Bryan Beaudreault created HBASE-28358:
-----------------------------------------
Summary: AsyncProcess inconsistent exception thrown for operation
timeout
Key: HBASE-28358
URL: https://issues.apache.org/jira/browse/HBASE-28358
Project: HBase
Issue Type: Bug
Reporter: Bryan Beaudreault
I'm not sure if I'll get to this, but wanted to log it as a known issue.
AsyncProcess has a design where it breaks the batch into sub-batches based on
regionserver, then submits a callable per regionserver in a threadpool. In the
main thread, it calls waitUntilDone() with an operation timeout. If the
callables don't finish within the operation timeout, a SocketTimeoutException
is thrown. This exception is not very useful because it doesn't give you any
sense of how many calls were in progress, on which servers, or why it's delayed.
Recently we've been improving the adherence to operation timeout within the
callables themselves. The main driver here has been to ensure we don't
erroneously clear the meta cache for operation timeout related errors. So we've
added a new OperationTimeoutExceededException, which is thrown from within the
callables and does not cause a meta cache clear. The added benefit is that if
these bubble up to the caller, they are wrapped in
RetriesExhaustedWithDetailsException which includes a lot more info about which
server and which action is affected.
Now we've covered most but not all cases where operation timeout is exceeded.
So when exceeding operation timeout it's possible sometimes to see a
SocketTimeoutException from waitUntilDone, and sometimes see
OperationTimeoutExceededException from the callables. It will depend on which
one fails first. It may be nice to finish the swing here, ensuring that we
always throw OperationTimeoutExceededException from the callables.
The main remaining case is in the call to locateRegion, which hits meta and
does not honor the call's operation timeout (instead meta operation timeout).
Resolving this would require some refactoring of
ConnectionImplementation.locateRegion to allow passing an operation timeout and
having that affect the userRegionLock and meta scan.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)