[
https://issues.apache.org/jira/browse/HBASE-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869064#comment-17869064
]
Duo Zhang commented on HBASE-28358:
-----------------------------------
Ping [~bbeaudreault].
There is also a related PR
https://github.com/apache/hbase/pull/6000
I think it is reasonable but since it is a behavior change, I think we need to
discuss more.
Thanks.
> AsyncProcess inconsistent exception thrown for operation timeout
> ----------------------------------------------------------------
>
> Key: HBASE-28358
> URL: https://issues.apache.org/jira/browse/HBASE-28358
> Project: HBase
> Issue Type: Bug
> Reporter: Bryan Beaudreault
> Priority: Major
>
> I'm not sure if I'll get to this, but wanted to log it as a known issue.
> AsyncProcess has a design where it breaks the batch into sub-batches based on
> regionserver, then submits a callable per regionserver in a threadpool. In
> the main thread, it calls waitUntilDone() with an operation timeout. If the
> callables don't finish within the operation timeout, a SocketTimeoutException
> is thrown. This exception is not very useful because it doesn't give you any
> sense of how many calls were in progress, on which servers, or why it's
> delayed.
> Recently we've been improving the adherence to operation timeout within the
> callables themselves. The main driver here has been to ensure we don't
> erroneously clear the meta cache for operation timeout related errors. So
> we've added a new OperationTimeoutExceededException, which is thrown from
> within the callables and does not cause a meta cache clear. The added benefit
> is that if these bubble up to the caller, they are wrapped in
> RetriesExhaustedWithDetailsException which includes a lot more info about
> which server and which action is affected.
> Now we've covered most but not all cases where operation timeout is exceeded.
> So when exceeding operation timeout it's possible sometimes to see a
> SocketTimeoutException from waitUntilDone, and sometimes see
> OperationTimeoutExceededException from the callables. It will depend on which
> one fails first. It may be nice to finish the swing here, ensuring that we
> always throw OperationTimeoutExceededException from the callables.
> The main remaining case is in the call to locateRegion, which hits meta and
> does not honor the call's operation timeout (instead meta operation timeout).
> Resolving this would require some refactoring of
> ConnectionImplementation.locateRegion to allow passing an operation timeout
> and having that affect the userRegionLock and meta scan.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)