Bryan Beaudreault created HBASE-27593:
-----------------------------------------
Summary: Clear meta cache for full server when handling
FailedServerException
Key: HBASE-27593
URL: https://issues.apache.org/jira/browse/HBASE-27593
Project: HBase
Issue Type: Improvement
Reporter: Bryan Beaudreault
Currently we prefer to clear meta cache for an individual region that fails.
This is preferred in most cases, because clearing cache for an entire server is
much more expensive. If a server hosts 100 regions, unnecessarily clearing the
cache for the entire server would cause 100 meta requests per client.
However, when a client fails to connect to a regionserver, it gets added to the
FailedServers list. Subsequent requests to that server are fast-failed,
throwing a FailedServerException.
This is a pretty clear indicator that there's a problem with a specific server.
In this case I think we should clear the cache for that full server.
We had a production incident recently where a server completely hung and we did
see "Clear Region" calls, but the server hosted many regions and the meta
clears continued for a while longer than necessary. Adding "Clear Server" call
due to FailedServers would have mitigated this issue much quicker.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)