[jira] [Comment Edited] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

Xiaolin Ha (Jira) Sat, 10 Apr 2021 23:27:07 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-25709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318654#comment-17318654
 ]


Xiaolin Ha edited comment on HBASE-25709 at 4/11/21, 6:26 AM:
--------------------------------------------------------------

Thanks, [~stack]. ^^ It's a very interesting problem, and I spent some time 
getting familiar with the scanner codes.

Codes show that user scanners can break the loop when it has reached the 
time/batch limit, though SKIP many cells in the process. At the beginning of 
the loop, it will check the time limit when the count of kvs scanned has up to 
the heart beat limit. So in this issue, user scanners can return before the 
timeout?

But compaction scanners didn't set the time limit, so it might take a very long 
time to return from the loop until it has read batch limited size cells. 

As a result,  we can use the `hbase.hstore.close.check.time.interval` as the 
time limit for compaction scanners, no need to add this 
`preventLoopReadEnabled` variable to ScannerContext? But refreshing the time 
limit after every time next() is not a regular practice.  

So the variable `preventLoopReadEnabled` may be just a flag for timed return 
when there are no time limits for scanners.

If set preventLoopReadEnabled default on is better, I can change it. ^^

 


was (Author: xiaolin ha):
Yes, I'll also turn on this for user scanners to avoid unexpected long queries. 
Thanks, [~stack]. I got it.

Seems that we can check time limit for scanners when got per heart beat cells 
just before the next cycle?  

Codes are as follows,
{code:java}
...
  // when reaching the heartbeat check, try to return from the loop,
  // to avoid the cycle of skipped cells. See HBASE-25709
  if (kvsScanned % cellsPerHeartbeatCheck == 0) {
    if (scannerContext.checkTimeLimit(LimitScope.BETWEEN_CELLS)) {
      return 
scannerContext.setScannerState(NextState.TIME_LIMIT_REACHED).hasMoreValues();
    }
  }
} while ((cell = this.heap.peek()) != null);
{code}
As a result,  we can use the `hbase.hstore.close.check.time.interval` as the 
time limit for compaction scanners, no need to add this 
`preventLoopReadEnabled` variable to ScannerContext? 

If set preventLoopReadEnabled default on is better, I can change it. ^^

 

> Close region may stuck when region is compacting and skipped most cells read
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-25709
>                 URL: https://issues.apache.org/jira/browse/HBASE-25709
>             Project: HBase
>          Issue Type: Improvement
>          Components: Compaction
>    Affects Versions: 1.4.13
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>         Attachments: Master-UI-RIT.png, RS-region-state.png
>
>
> We found in our cluster about stop region stuck. The region is compacting, 
> and its store files has many TTL expired cells. Close region state 
> marker(HRegion#writestate.writesEnabled) is not checked in compaction, 
> because most cells were skipped. 
> !RS-region-state.png|width=698,height=310!
>  
> !Master-UI-RIT.png|width=693,height=157!
>  
> HBASE-23968 has encountered similar problem, but the solution in it is outer 
> the method
> InternalScanner#next(List<Cell> result, ScannerContext scannerContext), which 
> will not return if there are many skipped cells, for current compaction 
> scanner context. As a result, we need to return in time in the next method, 
> and then check the stop marker.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HBASE-25709) Close region may stuck when region is compacting and skipped most cells read

Reply via email to