[ 
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553539#comment-17553539
 ] 

Xiaolin Ha commented on HBASE-25709:
------------------------------------

Thanks for the excellent digging, [~vjasani] .

The problem here is that, after returned by the heartbeat cells, the matcher 
was unexpectedly reset,
{code:java}
// If no limits exists in the scope LimitScope.Between_Cells then we are sure 
we are changing
// rows. Else it is possible we are still traversing the same row so we must 
perform the row
// comparison.
if (!scannerContext.hasAnyLimit(LimitScope.BETWEEN_CELLS) || 
matcher.currentRow() == null) {
  this.countPerRow = 0;
  matcher.setToNewRow(cell);
} {code}
then all the deletes of the row is cleared, and the next same row cells will be 
wrongly matched.

As in the test case you provided, when 
StoreScanner.HBASE_CELLS_SCANNED_PER_HEARTBEAT_CHECK=2 and mid-results are 
[q4,q5], and the heartbeat count reached after [q8/delete], then the [q8/put] 
will be wrongly added to results because matcher does not has [q8/delete].

I think for this issue, we can use the HRegion#checkInterrupt to avoid the 
stuck when closing region. Just as codes here,
{code:java}
         // when reaching the heartbeat cells, try to return from the loop.
         if (kvsScanned % cellsPerHeartbeatCheck == 0) {
-          return 
scannerContext.setScannerState(NextState.MORE_VALUES).hasMoreValues();
+          this.store.getHRegion().checkInterrupt();
         } {code}
But we can also fix the matcher issue, because I think the matcher reset show 
only happens when scanner reached new rows. And the early return when reached 
heartbeat cells can make user scanners return as soon as possible when 
encounters mass deletion. The fix codes are,
{code:java}
@@ -276,9 +278,12 @@ public abstract class ScanQueryMatcher implements 
ShipperListener {
    * @param currentRow
    */
   public void setToNewRow(Cell currentRow) {
-    this.currentRow = currentRow;
-    columns.reset();
-    reset();
+    if (this.currentRow == null
+      || this.rowComparator.compareRows(currentRow, this.currentRow) != 0) {
+      this.currentRow = currentRow;
+      columns.reset();
+      reset();
+    }
   } {code}
I prefer the second solution, what do you think? [~vjasani] [~apurtell] 

 

 

> Close region may stuck when region is compacting and skipped most cells read
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-25709
>                 URL: https://issues.apache.org/jira/browse/HBASE-25709
>             Project: HBase
>          Issue Type: Bug
>          Components: Compaction
>    Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>             Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.11
>
>         Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting, 
> and its store files has many TTL expired cells. Close region state 
> marker(HRegion#writestate.writesEnabled) is not checked in compaction, 
> because most cells were skipped. 
> !RS-region-state.png|width=698,height=310!
>  
> !Master-UI-RIT.png|width=693,height=157!
>  
> HBASE-23968 has encountered similar problem, but the solution in it is outer 
> the method
> InternalScanner#next(List<Cell> result, ScannerContext scannerContext), which 
> will not return if there are many skipped cells, for current compaction 
> scanner context. As a result, we need to return in time in the next method, 
> and then check the stop marker.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to