[jira] [Commented] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

Viraj Jasani (Jira) Fri, 10 Jun 2022 12:17:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552935#comment-17552935
 ]


Viraj Jasani commented on HBASE-25709:
--------------------------------------

Thanks [~Xiaolin Ha].

I did some more digging and I do see discrepancies in Scan results w.r.t cells 
scanned interval per heartbeat check.

 

If you apply this patch, the test passes whereas it should have failed. I am 
deleting the CQ after first Scan and before incrementing time:
{code:java}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
index d41916ae3b..6736e12ad9 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
@@ -6955,6 +6955,7 @@ public class TestHRegion {
 
     // A query at time T+0 should return all cells
     checkScan(8);
+    region.delete(new Delete(row).addColumn(fam1, q8));
 
     // Increment time to T+ttlSecs seconds
     edge.incrementTime(ttlSecs * 1000); {code}
The expectation with this patch is that it should fail at *_checkScan(3)_* 
because it should have returned only 2 results instead of 3.

 

However, if I apply this patch i.e. increment cells scanned per heartbeat 
check, the test fails as expected:
{code:java}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
index d41916ae3b..a52eaecc48 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
@@ -6928,7 +6928,7 @@ public class TestHRegion {
     Configuration conf = new Configuration(TEST_UTIL.getConfiguration());
     conf.setInt(HFile.FORMAT_VERSION_KEY, HFile.MIN_FORMAT_VERSION_WITH_TAGS);
     // using small heart beat cells
-    conf.setLong(StoreScanner.HBASE_CELLS_SCANNED_PER_HEARTBEAT_CHECK, 2);
+    conf.setLong(StoreScanner.HBASE_CELLS_SCANNED_PER_HEARTBEAT_CHECK, 20000);
 
     region = HBaseTestingUtility.createRegionAndWAL(
       RegionInfoBuilder.newBuilder(tableDescriptor.getTableName()).build(),
@@ -6955,6 +6955,7 @@ public class TestHRegion {
 
     // A query at time T+0 should return all cells
     checkScan(8);
+    region.delete(new Delete(row).addColumn(fam1, q8));
 
     // Increment time to T+ttlSecs seconds
     edge.incrementTime(ttlSecs * 1000); {code}
Test failure logs:
{code:java}
java.lang.AssertionError: 
Expected :3
Actual   :2
<Click to see difference>
    at org.junit.Assert.fail(Assert.java:89)
    at org.junit.Assert.failNotEquals(Assert.java:835)
    at org.junit.Assert.assertEquals(Assert.java:647)
    at org.junit.Assert.assertEquals(Assert.java:633)
    at 
org.apache.hadoop.hbase.regionserver.TestHRegion.checkScan(TestHRegion.java:6972)
    at 
org.apache.hadoop.hbase.regionserver.TestHRegion.testTTLsUsingSmallHeartBeatCells(TestHRegion.java:6962)
 {code}
This is exactly the expected behaviour, which occurs only if we increment cells 
scanned per heartbeat check to very high number.

We should fix this.

 

FYI [~apurtell] for upcoming 2.5 and 2.4 releases.

 

FYI [~kadir] [~gjacoby] [~tkhurana] 

> Close region may stuck when region is compacting and skipped most cells read
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-25709
>                 URL: https://issues.apache.org/jira/browse/HBASE-25709
>             Project: HBase
>          Issue Type: Bug
>          Components: Compaction
>    Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>             Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.11
>
>         Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting, 
> and its store files has many TTL expired cells. Close region state 
> marker(HRegion#writestate.writesEnabled) is not checked in compaction, 
> because most cells were skipped. 
> !RS-region-state.png|width=698,height=310!
>  
> !Master-UI-RIT.png|width=693,height=157!
>  
> HBASE-23968 has encountered similar problem, but the solution in it is outer 
> the method
> InternalScanner#next(List<Cell> result, ScannerContext scannerContext), which 
> will not return if there are many skipped cells, for current compaction 
> scanner context. As a result, we need to return in time in the next method, 
> and then check the stop marker.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

Reply via email to