[
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553539#comment-17553539
]
Xiaolin Ha commented on HBASE-25709:
------------------------------------
Thanks for the excellent digging, [~vjasani] .
The problem here is that, after returned by the heartbeat cells, the matcher
was unexpectedly reset,
{code:java}
// If no limits exists in the scope LimitScope.Between_Cells then we are sure
we are changing
// rows. Else it is possible we are still traversing the same row so we must
perform the row
// comparison.
if (!scannerContext.hasAnyLimit(LimitScope.BETWEEN_CELLS) ||
matcher.currentRow() == null) {
this.countPerRow = 0;
matcher.setToNewRow(cell);
} {code}
then all the deletes of the row is cleared, and the next same row cells will be
wrongly matched.
As in the test case you provided, when
StoreScanner.HBASE_CELLS_SCANNED_PER_HEARTBEAT_CHECK=2 and mid-results are
[q4,q5], and the heartbeat count reached after [q8/delete], then the [q8/put]
will be wrongly added to results because matcher does not has [q8/delete].
I think for this issue, we can use the HRegion#checkInterrupt to avoid the
stuck when closing region. Just as codes here,
{code:java}
// when reaching the heartbeat cells, try to return from the loop.
if (kvsScanned % cellsPerHeartbeatCheck == 0) {
- return
scannerContext.setScannerState(NextState.MORE_VALUES).hasMoreValues();
+ this.store.getHRegion().checkInterrupt();
} {code}
But we can also fix the matcher issue, because I think the matcher reset show
only happens when scanner reached new rows. And the early return when reached
heartbeat cells can make user scanners return as soon as possible when
encounters mass deletion. The fix codes are,
{code:java}
@@ -276,9 +278,12 @@ public abstract class ScanQueryMatcher implements
ShipperListener {
* @param currentRow
*/
public void setToNewRow(Cell currentRow) {
- this.currentRow = currentRow;
- columns.reset();
- reset();
+ if (this.currentRow == null
+ || this.rowComparator.compareRows(currentRow, this.currentRow) != 0) {
+ this.currentRow = currentRow;
+ columns.reset();
+ reset();
+ }
} {code}
I prefer the second solution, what do you think? [~vjasani] [~apurtell]
> Close region may stuck when region is compacting and skipped most cells read
> ----------------------------------------------------------------------------
>
> Key: HBASE-25709
> URL: https://issues.apache.org/jira/browse/HBASE-25709
> Project: HBase
> Issue Type: Bug
> Components: Compaction
> Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.10
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.11
>
> Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting,
> and its store files has many TTL expired cells. Close region state
> marker(HRegion#writestate.writesEnabled) is not checked in compaction,
> because most cells were skipped.
> !RS-region-state.png|width=698,height=310!
>
> !Master-UI-RIT.png|width=693,height=157!
>
> HBASE-23968 has encountered similar problem, but the solution in it is outer
> the method
> InternalScanner#next(List<Cell> result, ScannerContext scannerContext), which
> will not return if there are many skipped cells, for current compaction
> scanner context. As a result, we need to return in time in the next method,
> and then check the stop marker.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)