[
https://issues.apache.org/jira/browse/HBASE-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-25212:
----------------------------------------
Status: Patch Available (was: Open)
> Optionally abort requests in progress after deciding a region should close
> --------------------------------------------------------------------------
>
> Key: HBASE-25212
> URL: https://issues.apache.org/jira/browse/HBASE-25212
> Project: HBase
> Issue Type: Improvement
> Components: regionserver
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
>
> After deciding a region should be closed, the regionserver will set the
> internal region state to closing and wait for all pending requests to
> complete, via a rendezvous on the region lock. In closing state the region
> will not accept any new requests but requests in progress will be allowed to
> complete before the close action takes place. In our production we see
> outlier wait times on this lock in excess of several minutes.
> During close when there are requests in flight the regionserver is subject to
> any conceivable reason for delay, like full scans over large regions,
> expensive filtering hierarchies, bugs, or store level performance problems
> like slow HDFS. The regionserver should interrupt requests in progress to
> facilitate smaller/shorter close times on an opt-in basis.
> Optionally, via configuration parameter -- which would be a system wide
> default set in hbase-site.xml in common practice but could be overridden in
> table schema for per table settings -- interrupt requests in progress holding
> the region lock rather than wait for completion of all operations in flight.
> Send back NotServingRegionException("region is closing") to the clients of
> the interrupted operations, like we do after the write lock is acquired. The
> client will transparently relocate the region data and resubmit the aborted
> requests per normal retry policy. This can be less disruptive than waiting
> for very long times for a region to close in extreme outlier cases (e.g. 50
> minutes). In such extreme cases it is better to abort the regionserver if the
> close lock cannot be acquired in a reasonable amount of time, because the
> region cannot be made available again until it has closed.
> After waiting for all requests to complete then we flush the region's
> memstore and finish the close. The flush portion of the close process is out
> of scope of this proposal. Under normal conditions the flush portion of the
> close completes quickly. It is specifically waits on the close lock that has
> been an occasional issue in our production that causes difficulty achieving
> 99.99% availability.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)