[ https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552935#comment-17552935 ]
Viraj Jasani commented on HBASE-25709: -------------------------------------- Thanks [~Xiaolin Ha]. I did some more digging and I do see discrepancies in Scan results w.r.t cells scanned interval per heartbeat check. If you apply this patch, the test passes whereas it should have failed. I am deleting the CQ after first Scan and before incrementing time: {code:java} diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java index d41916ae3b..6736e12ad9 100644 --- a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java +++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java @@ -6955,6 +6955,7 @@ public class TestHRegion { // A query at time T+0 should return all cells checkScan(8); + region.delete(new Delete(row).addColumn(fam1, q8)); // Increment time to T+ttlSecs seconds edge.incrementTime(ttlSecs * 1000); {code} The expectation with this patch is that it should fail at *_checkScan(3)_* because it should have returned only 2 results instead of 3. However, if I apply this patch i.e. increment cells scanned per heartbeat check, the test fails as expected: {code:java} diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java index d41916ae3b..a52eaecc48 100644 --- a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java +++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java @@ -6928,7 +6928,7 @@ public class TestHRegion { Configuration conf = new Configuration(TEST_UTIL.getConfiguration()); conf.setInt(HFile.FORMAT_VERSION_KEY, HFile.MIN_FORMAT_VERSION_WITH_TAGS); // using small heart beat cells - conf.setLong(StoreScanner.HBASE_CELLS_SCANNED_PER_HEARTBEAT_CHECK, 2); + conf.setLong(StoreScanner.HBASE_CELLS_SCANNED_PER_HEARTBEAT_CHECK, 20000); region = HBaseTestingUtility.createRegionAndWAL( RegionInfoBuilder.newBuilder(tableDescriptor.getTableName()).build(), @@ -6955,6 +6955,7 @@ public class TestHRegion { // A query at time T+0 should return all cells checkScan(8); + region.delete(new Delete(row).addColumn(fam1, q8)); // Increment time to T+ttlSecs seconds edge.incrementTime(ttlSecs * 1000); {code} Test failure logs: {code:java} java.lang.AssertionError: Expected :3 Actual :2 <Click to see difference> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.hbase.regionserver.TestHRegion.checkScan(TestHRegion.java:6972) at org.apache.hadoop.hbase.regionserver.TestHRegion.testTTLsUsingSmallHeartBeatCells(TestHRegion.java:6962) {code} This is exactly the expected behaviour, which occurs only if we increment cells scanned per heartbeat check to very high number. We should fix this. FYI [~apurtell] for upcoming 2.5 and 2.4 releases. FYI [~kadir] [~gjacoby] [~tkhurana] > Close region may stuck when region is compacting and skipped most cells read > ---------------------------------------------------------------------------- > > Key: HBASE-25709 > URL: https://issues.apache.org/jira/browse/HBASE-25709 > Project: HBase > Issue Type: Bug > Components: Compaction > Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10 > Reporter: Xiaolin Ha > Assignee: Xiaolin Ha > Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.11 > > Attachments: Master-UI-RIT.png, RS-region-state.png > > > We found in our cluster about stop region stuck. The region is compacting, > and its store files has many TTL expired cells. Close region state > marker(HRegion#writestate.writesEnabled) is not checked in compaction, > because most cells were skipped. > !RS-region-state.png|width=698,height=310! > > !Master-UI-RIT.png|width=693,height=157! > > HBASE-23968 has encountered similar problem, but the solution in it is outer > the method > InternalScanner#next(List<Cell> result, ScannerContext scannerContext), which > will not return if there are many skipped cells, for current compaction > scanner context. As a result, we need to return in time in the next method, > and then check the stop marker. > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)