[ 
https://issues.apache.org/jira/browse/HBASE-27521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-27521.
---------------------------------------
    Resolution: Not A Problem

I was only able to trigger this behavior under extreme load during tests. I 
tried a few different fixes and none of them materially affected the behavior. 
This leads me to believe that the bigger issue in this case is just the extreme 
load itself rather than the edge case of CallTimeoutException I describe.

I'm going to close this for now, but may reopen if we stumble across a need to 
fix it based on real world scenario.

> CallTimeoutException can cause feedback loop with meta clears
> -------------------------------------------------------------
>
>                 Key: HBASE-27521
>                 URL: https://issues.apache.org/jira/browse/HBASE-27521
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> In HBASE-27487 and HBASE-27490 we added safeguard which should reduce 
> feedback loop caused by slow meta. We have continued to chaos test the hbase 
> client and have found another case that needs to be handled.
> With those two jiras, we no longer allow multiget to exceed operation timeout 
> when meta is slow, and OperationTimeoutExceededExceptions do not clear meta 
> cache. This allows for quicker recovery in many cases.
> However, consider the case where you have a 1s RPC timeout and 3s operation 
> timeout. Let's say meta is slow and it takes 2.9 seconds to resolve region 
> locations for a batch. When we go to submit the multi actions to the server, 
> we will only have a 100ms remaining time on our operation timeout. This may 
> not be enough, and it results in a CallTimeoutException.
> I use slow meta as an example, but it's possible for any slow regionserver to 
> kick off a feedback loop due to a sudden surge in CallTimeoutException 
> resulting in many clients clearing cache and hitting meta. Even with meta 
> replicas, this just exacerbates any slowness and may become unrecoverable 
> without extreme actions.
> I also use multigets as the example here, and I think they are most at risk 
> of this, but this is theoretically possible for all request types. A single 
> Get might retry a few times and the last attempt only has a few milliseconds 
> of remaining time. This could also result in a CallTimeoutException and 
> potentially kick off a feedback loop.
> I'm still trying to consider options, and am open to opinions here. This 
> issue affects AyncTable and Table. Here are some raw options I've been 
> weighing:
>  * Treat CTE as special (non-clearing)
>  ** Problem: what if a failed server continues to timeout and we have no idea 
> that regions have moved?
>  ** Can we differentiate CTE vs some sort of SocketTimeoutException (where 
> server is not serving)?
>  * Rate limit cache clears from CTE
>  * Rate limit all cache clears
>  * Rethrow CTE as an OperationTimeoutExceededException when remainingTime < 
> rpcTimeout, thus skipping clear in that case
>  ** This is the most targeted solution, but may leave other edge cases.
>  ** Problem: what about the case where a server is just running slow, but 
> continually causing cache clears?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to