[jira] [Updated] (HBASE-27487) Slow meta can create pathological feedback loop with multigets

Bryan Beaudreault (Jira) Tue, 23 Apr 2024 10:54:08 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Beaudreault updated HBASE-27487:
--------------------------------------
    Fix Version/s: 2.6.0

> Slow meta can create pathological feedback loop with multigets
> --------------------------------------------------------------
>
>                 Key: HBASE-27487
>                 URL: https://issues.apache.org/jira/browse/HBASE-27487
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.5.1, 2.4.15
>            Reporter: Bryan Beaudreault
>            Assignee: Briana Augenreich
>            Priority: Major
>             Fix For: 2.6.0, 2.4.16, 2.5.3
>
>
> This only affects the Table implementation in 2.x releases.
> h4. Call stack
> When Table.batch is called, an AsyncProcessTask is created with 
> SubmittedRows.ALL, which is sent to AsyncProcess.submit(). For the ALL case, 
> this goes to submitAll which creates an AsyncRequestFutureImpl and then calls 
> groupAndSendMultiAction on that.
> When a AsyncRequestFutureImpl is created, a RetryingTimeTracker is created 
> and started as the last step of the constructor.
> In groupAndSendMultiAction, the first thing that has to be done is resolve 
> the HRegionLocation for every action in the batch. This is currently done 
> sequentially, with no timeout on the overall batch completion.
> Once all actions have been resolved, they are passed into sendMultiAction 
> which creates a SingleServerRequestRunnable. Once that runnable is executed, 
> the first thing it does is create a new MultiServerCallable using the same 
> RetryingTimeTracker that was originally created way back.
> That callable extends CancellableRegionServerCallable, and the call method 
> first checks the tracker.getRemainingTime() before actually doing any work. 
> If exceeded, it throws an exception.
> h4. Problem
> If meta is overloaded, or you send any sufficiently large batch of actions, 
> the resolving of HRegionLocations (which happens sequentially) may take a 
> while.
> Depending on the operation timeout configured for the client, that duration 
> may already exceed that timeout before even reaching the 
> CancellableRegionServerCallable.call().
> When the timeout is exceeded there, a DoNotRetryIOException is thrown. This 
> is considered a cache clearing exception, so any locations that may have been 
> slowly resolved earlier up the chain will be thrown away.
> If done with enough concurrency, this can create a feedback loop that is 
> impossible to recover from.
> h4. Potential Solutions
>  # Change the thrown exception type from DoNotRetryIOException to something 
> more appropriate for the actual error (some sort of timeout exception). We'd 
> have to make that exception a "special" exception in ClientExceptionUtil so 
> that it doesn't clear the cache.
>  # Make DoNotRetryIOException itself a "special" exception. The point of 
> clearing cache is to make retries more likely to succeed if the failure was 
> related to a wrong location. But DoNotRetryIOException explicitly is not 
> supposed to be retried, so you might think it shouldn't clear the cache as 
> well. There are many usages of this exception, so it's hard to say for sure 
> that this would be universally safe.
>  # Reset the RetryingTimeTracker after resolving region locations.
> I think I'd lean towards option 1, because it seems odd to say "don't retry 
> in that case". In fact, retrying should be more likely to succeed because 
> locations will have been resolved.
> Whichever we choose, I think we should additionally check the timeout in 
> groupAndSendMultiAction after resolving each region location. We should not 
> allow that process to exceed timeouts and currently it can way more than 
> exceed them before finally being checked incidentally at the end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27487) Slow meta can create pathological feedback loop with multigets

Reply via email to