[ https://issues.apache.org/jira/browse/HBASE-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219261#comment-17219261 ]
Andrew Kyle Purtell commented on HBASE-25212: --------------------------------------------- After upgrading to Java 11 realized I was just missing a change to HTU. Never mind. > Optionally abort requests in progress after deciding a region should close > -------------------------------------------------------------------------- > > Key: HBASE-25212 > URL: https://issues.apache.org/jira/browse/HBASE-25212 > Project: HBase > Issue Type: Improvement > Components: regionserver > Reporter: Andrew Kyle Purtell > Assignee: Andrew Kyle Purtell > Priority: Major > Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0 > > > After deciding a region should be closed, the regionserver will set the > internal region state to closing and wait for all pending requests to > complete, via a rendezvous on the region lock. In closing state the region > will not accept any new requests but requests in progress will be allowed to > complete before the close action takes place. In our production we see > outlier wait times on this lock in excess of several minutes. > During close when there are requests in flight the regionserver is subject to > any conceivable reason for delay, like full scans over large regions, > expensive filtering hierarchies, bugs, or store level performance problems > like slow HDFS. The regionserver should interrupt requests in progress to > facilitate smaller/shorter close times on an opt-in basis. > Optionally, via configuration parameter -- which would be a system wide > default set in hbase-site.xml in common practice but could be overridden in > table schema for per table settings -- interrupt requests in progress holding > the region lock rather than wait for completion of all operations in flight. > Send back NotServingRegionException("region is closing") to the clients of > the interrupted operations, like we do after the write lock is acquired. The > client will transparently relocate the region data and resubmit the aborted > requests per normal retry policy. This can be less disruptive than waiting > for very long times for a region to close in extreme outlier cases (e.g. 50 > minutes). In such extreme cases it is better to abort the regionserver if the > close lock cannot be acquired in a reasonable amount of time, because the > region cannot be made available again until it has closed. > After waiting for all requests to complete then we flush the region's > memstore and finish the close. The flush portion of the close process is out > of scope of this proposal. Under normal conditions the flush portion of the > close completes quickly. It is specifically waits on the close lock that has > been an occasional issue in our production that causes difficulty achieving > 99.99% availability. -- This message was sent by Atlassian Jira (v8.3.4#803005)