Bryan Beaudreault created HBASE-27521:
-----------------------------------------

             Summary: CallTimeoutException can cause feedback loop with meta 
clears
                 Key: HBASE-27521
                 URL: https://issues.apache.org/jira/browse/HBASE-27521
             Project: HBase
          Issue Type: Improvement
            Reporter: Bryan Beaudreault


In HBASE-27487 and HBASE-27490 we added safeguard which should reduce feedback 
loop caused by slow meta. We have continued to chaos test the hbase client and 
have found another case that needs to be handled.

With those two jiras, we no longer allow multiget to exceed operation timeout 
when meta is slow, and OperationTimeoutExceededExceptions do not clear meta 
cache. This allows for quicker recovery in many cases.

However, consider the case where you have a 1s RPC timeout and 3s operation 
timeout. Let's say meta is slow and it takes 2.9 seconds to resolve region 
locations for a batch. When we go to submit the multi actions to the server, we 
will only have a 100ms remaining time on our operation timeout. This may not be 
enough, and it results in a CallTimeoutException.

I use slow meta as an example, but it's possible for any slow regionserver to 
kick off a feedback loop due to a sudden surge in CallTimeoutException 
resulting in many clients clearing cache and hitting meta. Even with meta 
replicas, this just exacerbates any slowness and may become unrecoverable 
without extreme actions.

I also use multigets as the example here, and I think they are most at risk of 
this, but this is theoretically possible for all request types. A single Get 
might retry a few times and the last attempt only has a few milliseconds of 
remaining time. This could also result in a CallTimeoutException and 
potentially kick off a feedback loop.

I'm still trying to consider options, and am open to opinions here. This issue 
affects AyncTable and Table. Here are some raw options I've been weighing:
 * Treat CTE as special (non-clearing)
 * Problem: what if a failed server continues to timeout and we have no idea 
that regions have moved?
 * Can we differentiate CTE vs some sort of SocketTimeoutException (where 
server is not serving)?


 * Rate limit cache clears from CTE
 * Rate limit all cache clears
 * Treat CTE as an OperationTimeoutExceededException when remainingTime < 
rpcTimeout, thus skipping clear in that case
 * This is the most targeted solution, which may leave other edge cases.
 * Problem: what about the case where a server is just running slow, but 
continually causing cache clears?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to