[
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318654#comment-17318654
]
Xiaolin Ha edited comment on HBASE-25709 at 4/11/21, 6:26 AM:
--------------------------------------------------------------
Thanks, [~stack]. ^^ It's a very interesting problem, and I spent some time
getting familiar with the scanner codes.
Codes show that user scanners can break the loop when it has reached the
time/batch limit, though SKIP many cells in the process. At the beginning of
the loop, it will check the time limit when the count of kvs scanned has up to
the heart beat limit. So in this issue, user scanners can return before the
timeout?
But compaction scanners didn't set the time limit, so it might take a very long
time to return from the loop until it has read batch limited size cells.
As a result, we can use the `hbase.hstore.close.check.time.interval` as the
time limit for compaction scanners, no need to add this
`preventLoopReadEnabled` variable to ScannerContext? But refreshing the time
limit after every time next() is not a regular practice.
So the variable `preventLoopReadEnabled` may be just a flag for timed return
when there are no time limits for scanners.
If set preventLoopReadEnabled default on is better, I can change it. ^^
was (Author: xiaolin ha):
Yes, I'll also turn on this for user scanners to avoid unexpected long queries.
Thanks, [~stack]. I got it.
Seems that we can check time limit for scanners when got per heart beat cells
just before the next cycle?
Codes are as follows,
{code:java}
...
// when reaching the heartbeat check, try to return from the loop,
// to avoid the cycle of skipped cells. See HBASE-25709
if (kvsScanned % cellsPerHeartbeatCheck == 0) {
if (scannerContext.checkTimeLimit(LimitScope.BETWEEN_CELLS)) {
return
scannerContext.setScannerState(NextState.TIME_LIMIT_REACHED).hasMoreValues();
}
}
} while ((cell = this.heap.peek()) != null);
{code}
As a result, we can use the `hbase.hstore.close.check.time.interval` as the
time limit for compaction scanners, no need to add this
`preventLoopReadEnabled` variable to ScannerContext?
If set preventLoopReadEnabled default on is better, I can change it. ^^
> Close region may stuck when region is compacting and skipped most cells read
> ----------------------------------------------------------------------------
>
> Key: HBASE-25709
> URL: https://issues.apache.org/jira/browse/HBASE-25709
> Project: HBase
> Issue Type: Improvement
> Components: Compaction
> Affects Versions: 1.4.13
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
> Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting,
> and its store files has many TTL expired cells. Close region state
> marker(HRegion#writestate.writesEnabled) is not checked in compaction,
> because most cells were skipped.
> !RS-region-state.png|width=698,height=310!
>
> !Master-UI-RIT.png|width=693,height=157!
>
> HBASE-23968 has encountered similar problem, but the solution in it is outer
> the method
> InternalScanner#next(List<Cell> result, ScannerContext scannerContext), which
> will not return if there are many skipped cells, for current compaction
> scanner context. As a result, we need to return in time in the next method,
> and then check the stop marker.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)