Andrew Kyle Purtell created HBASE-25212:
-------------------------------------------
Summary: Optionally abort requests in progress after deciding a
region should close
Key: HBASE-25212
URL: https://issues.apache.org/jira/browse/HBASE-25212
Project: HBase
Issue Type: Improvement
Components: regionserver
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
After deciding a region should be closed, the regionserver will set the
internal region state to closing and wait for all pending requests to complete,
via a rendezvous on the region lock. In closing state the region will not
accept any new requests but requests in progress will be allowed to complete
before the close action takes place. In our production we see outlier wait
times on this lock in excess of several minutes.
During close when there are requests in flight the regionserver is subject to
any conceivable reason for delay, like full scans over large regions, expensive
filtering hierarchies, bugs, or store level performance problems like slow
HDFS. The regionserver should interrupt requests in progress to facilitate
smaller/shorter close times on an opt-in basis.
Optionally, via configuration parameter -- which would be a system wide default
set in hbase-site.xml in common practice but could be overridden in table
schema for per table settings -- interrupt requests in progress holding the
region lock rather than wait for completion of all operations in flight. Send
back NotServingRegionException("region is closing") to the clients of the
interrupted operations, like we do after the write lock is acquired. The client
will transparently relocate the region data and resubmit the aborted requests
per normal retry policy. This can be less disruptive than waiting for very long
times for a region to close in extreme outlier cases (e.g. 50 minutes).
After waiting for all requests to complete then we flush the region's memstore
and finish the close. The flush portion of the close process is out of scope of
this proposal. Under normal conditions the flush portion of the close completes
quickly. It is specifically waits on the close lock that has been an occasional
issue in our production that causes difficulty achieving 99.99% availability.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)