[ 
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Roudnitsky updated HBASE-27781:
--------------------------------------
    Description: 
+Background+

In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
during location resolution 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
 In that handling, we loop over all actions still being processed in the 
groupAndSendMulti at the time of the operation timeout being exceeded and set 
them as failed. The problem is, some number of these actions may have already 
failed to completion when we get to this spot - if we fail to resolve region 
location for an action we will fail it to completion in 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 (fail to completion == set the error for the action, decrement actions in 
progress counter, and do not retry the action again) - and we should not 
"double fail" any actions that were already failed due to failed location 
resolution because we will decrement the actions in progress counter twice for 
the same action, and throw off the (atomic) action counter accounting the sync 
client relies on to [tell when the batch operation is 
complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].

+Problem+

In the for loop 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
 we fail all actions (and decrement action in progress counter for all actions) 
in the groupAndSendMulti - which includes the aforementioned actions that were 
already failed through 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 - causing us to decrement the actions in progress counter more times than than 
there are actions. This causes an assertion error in the actions in progress 
counter since we go negative 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
 and should never have a negative amount of actions in progress, causing the 
HBase client to throw an unchecked exception that is not handled within the 
client which bubbles up to the user application layer invoking the client, 
which may kill the caller thread/application that invoked the operation that 
should have timed out with a RetriesExhaustedWithDetails exception (rather than 
throwing an unchecked AssertionError), as the user application layer should not 
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

+Triggering scenario/reproduction+

The most common scenario where one could hit this bug is if there is meta 
slowness when running batch operations. Suppose we have a batch with 3 actions, 
and on trying to resolve the location for the first action, we timeout 
repeatedly to the meta table due to meta slowness and consume the entire 
operation timeout on the meta scan attempts to resolve the location of the 
first action. In this case, we will fail the first action through  
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 which bring the actionsInProgress counter to 2, and then we will [loop over 
all three 
actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
 and fail each of them, on the third action failure attempt the actions in 
progress counter is zero and we attempt to decrement it to -1, and hit the 
assertion error. This is what the test case in the PR successfully reproduces. 

+Solution+
We still want to fail all remaining/incomplete actions being processed in 
groupAndSendMulti at the time of the operation timeout being exceeded, because 
there is no time remaining to execute them, but we need special handling to 
avoid failing actions which were already failed due to failed location 
resolution. 

  was:
+Background+

In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
during location resolution 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
 In that handling, we loop over all actions still being processed in the 
groupAndSendMulti at the time of the operation timeout being exceeded and set 
them as failed. The problem is, some number of these actions may have already 
failed to completion when we get to this spot - if we fail to resolve region 
location for an action we will fail it to completion in 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 (fail to completion == set the error for the action, decrement actions in 
progress counter, and do not retry the action again) - and we should not 
"double fail" any actions that were already failed due to failed location 
resolution because we will decrement the actions in progress counter twice for 
the same action, and throw off the (atomic) action counter accounting the sync 
client relies on to [tell that the batch operation is 
complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].

+Problem+

In the for loop 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
 we fail all actions (and decrement action in progress counter for all actions) 
in the groupAndSendMulti - which includes the aforementioned actions that were 
already failed through 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 - causing us to decrement the actions in progress counter more times than than 
there are actions. This causes an assertion error in the actions in progress 
counter since we go negative 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
 and should never have a negative amount of actions in progress, causing the 
HBase client to throw an unchecked exception that is not handled within the 
client which bubbles up to the user application layer invoking the client, 
which may kill the caller thread/application that invoked the operation that 
should have timed out with a RetriesExhaustedWithDetails exception (rather than 
throwing an unchecked AssertionError), as the user application layer should not 
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

+Triggering scenario/reproduction+

The most common scenario where one could hit this bug is if there is meta 
slowness when running batch operations. Suppose we have a batch with 3 actions, 
and on trying to resolve the location for the first action, we timeout 
repeatedly to the meta table due to meta slowness and consume the entire 
operation timeout on those meta scan attempts to resolve the location of the 
first action. In this case, we will fail the first action through  
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 bring the actionsInProgress counter to 2, and then we will [loop over all 
three 
actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
 and fail each of them, on the third action failure attempt the actions in 
progress counter is zero and we attempt to decrement it to -1, and hit the 
assertion error. This is what the test case in the PR successfully reproduces. 

+Solution+
We still want to fail all remaining/incomplete actions being processed in 
groupAndSendMulti at the time of the operation timeout being exceeded, because 
there is no time remaining to execute them, but we need special handling to 
avoid failing actions which were already failed due to failed location 
resolution. 


> AssertionError in AsyncRequestFutureImpl when timing out during location 
> resolution
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-27781
>                 URL: https://issues.apache.org/jira/browse/HBASE-27781
>             Project: HBase
>          Issue Type: Bug
>          Components: asyncclient
>            Reporter: Bryan Beaudreault
>            Assignee: Daniel Roudnitsky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.6.3
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
> during location resolution 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
>  In that handling, we loop over all actions still being processed in the 
> groupAndSendMulti at the time of the operation timeout being exceeded and set 
> them as failed. The problem is, some number of these actions may have already 
> failed to completion when we get to this spot - if we fail to resolve region 
> location for an action we will fail it to completion in 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  (fail to completion == set the error for the action, decrement actions in 
> progress counter, and do not retry the action again) - and we should not 
> "double fail" any actions that were already failed due to failed location 
> resolution because we will decrement the actions in progress counter twice 
> for the same action, and throw off the (atomic) action counter accounting the 
> sync client relies on to [tell when the batch operation is 
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  we fail all actions (and decrement action in progress counter for all 
> actions) in the groupAndSendMulti - which includes the aforementioned actions 
> that were already failed through 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  - causing us to decrement the actions in progress counter more times than 
> than there are actions. This causes an assertion error in the actions in 
> progress counter since we go negative 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
>  and should never have a negative amount of actions in progress, causing the 
> HBase client to throw an unchecked exception that is not handled within the 
> client which bubbles up to the user application layer invoking the client, 
> which may kill the caller thread/application that invoked the operation that 
> should have timed out with a RetriesExhaustedWithDetails exception (rather 
> than throwing an unchecked AssertionError), as the user application layer 
> should not be catching {{Error}} and its subclasses like 
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta 
> slowness when running batch operations. Suppose we have a batch with 3 
> actions, and on trying to resolve the location for the first action, we 
> timeout repeatedly to the meta table due to meta slowness and consume the 
> entire operation timeout on the meta scan attempts to resolve the location of 
> the first action. In this case, we will fail the first action through  
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  which bring the actionsInProgress counter to 2, and then we will [loop over 
> all three 
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  and fail each of them, on the third action failure attempt the actions in 
> progress counter is zero and we attempt to decrement it to -1, and hit the 
> assertion error. This is what the test case in the PR successfully 
> reproduces. 
> +Solution+
> We still want to fail all remaining/incomplete actions being processed in 
> groupAndSendMulti at the time of the operation timeout being exceeded, 
> because there is no time remaining to execute them, but we need special 
> handling to avoid failing actions which were already failed due to failed 
> location resolution. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to