Bryan Beaudreault created HBASE-27521:
-----------------------------------------
Summary: CallTimeoutException can cause feedback loop with meta
clears
Key: HBASE-27521
URL: https://issues.apache.org/jira/browse/HBASE-27521
Project: HBase
Issue Type: Improvement
Reporter: Bryan Beaudreault
In HBASE-27487 and HBASE-27490 we added safeguard which should reduce feedback
loop caused by slow meta. We have continued to chaos test the hbase client and
have found another case that needs to be handled.
With those two jiras, we no longer allow multiget to exceed operation timeout
when meta is slow, and OperationTimeoutExceededExceptions do not clear meta
cache. This allows for quicker recovery in many cases.
However, consider the case where you have a 1s RPC timeout and 3s operation
timeout. Let's say meta is slow and it takes 2.9 seconds to resolve region
locations for a batch. When we go to submit the multi actions to the server, we
will only have a 100ms remaining time on our operation timeout. This may not be
enough, and it results in a CallTimeoutException.
I use slow meta as an example, but it's possible for any slow regionserver to
kick off a feedback loop due to a sudden surge in CallTimeoutException
resulting in many clients clearing cache and hitting meta. Even with meta
replicas, this just exacerbates any slowness and may become unrecoverable
without extreme actions.
I also use multigets as the example here, and I think they are most at risk of
this, but this is theoretically possible for all request types. A single Get
might retry a few times and the last attempt only has a few milliseconds of
remaining time. This could also result in a CallTimeoutException and
potentially kick off a feedback loop.
I'm still trying to consider options, and am open to opinions here. This issue
affects AyncTable and Table. Here are some raw options I've been weighing:
* Treat CTE as special (non-clearing)
* Problem: what if a failed server continues to timeout and we have no idea
that regions have moved?
* Can we differentiate CTE vs some sort of SocketTimeoutException (where
server is not serving)?
* Rate limit cache clears from CTE
* Rate limit all cache clears
* Treat CTE as an OperationTimeoutExceededException when remainingTime <
rpcTimeout, thus skipping clear in that case
* This is the most targeted solution, which may leave other edge cases.
* Problem: what about the case where a server is just running slow, but
continually causing cache clears?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)