Bryan Beaudreault created HBASE-28128:
-----------------------------------------
Summary: Reject requests at RPC layer when RegionServer is aborting
Key: HBASE-28128
URL: https://issues.apache.org/jira/browse/HBASE-28128
Project: HBase
Issue Type: Improvement
Reporter: Bryan Beaudreault
We recently had an operational incident where the RegionServer got aborted, but
failed to exit within a reasonable timeframe. We're going to tune
hbase.regionserver.abort.timeout much lower than the 20m default, but even with
that it makes little sense to accept requests when the server is aborting.
In our case, the server was impaired and not processing requests. The call
queue was full, so NettyRpcServer kept trying and failing to add requests to
the queue. This results in CallQueueTooBigException, which is not a meta cache
clearing exception. It continued throwing these exceptions for multiple minutes
until we finally manually killed the server.
I'd like to add a check in ServerRpcConnection.processRequest, where we check
if regionServer.isAborted() and throw a RegionServerAbortedException rather
than attempt to enqueue the request.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)