Bryan Beaudreault created HBASE-28128:
-----------------------------------------

             Summary: Reject requests at RPC layer when RegionServer is aborting
                 Key: HBASE-28128
                 URL: https://issues.apache.org/jira/browse/HBASE-28128
             Project: HBase
          Issue Type: Improvement
            Reporter: Bryan Beaudreault


We recently had an operational incident where the RegionServer got aborted, but 
failed to exit within a reasonable timeframe. We're going to tune 
hbase.regionserver.abort.timeout much lower than the 20m default, but even with 
that it makes little sense to accept requests when the server is aborting.

In our case, the server was impaired and not processing requests. The call 
queue was full, so NettyRpcServer kept trying and failing to add requests to 
the queue. This results in CallQueueTooBigException, which is not a meta cache 
clearing exception. It continued throwing these exceptions for multiple minutes 
until we finally manually killed the server.

I'd like to add a check in ServerRpcConnection.processRequest, where we check 
if regionServer.isAborted() and throw a RegionServerAbortedException rather 
than attempt to enqueue the request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to