[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004361#comment-18004361
]
Hudson commented on HBASE-27781:
--------------------------------
Results for branch branch-2.6
[build #336 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop 3.3.5 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(x) {color:red}-1 jdk17 hadoop 3.3.6 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk17 hadoop 3.4.0 backward compatibility checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.6/336/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test for HBase 2 {color}
(/) {color:green}+1 client integration test for 3.3.5 {color}
(/) {color:green}+1 client integration test for 3.3.6 {color}
(/) {color:green}+1 client integration test for 3.4.0 {color}
(/) {color:green}+1 client integration test for 3.4.1 {color}
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> -----------------------------------------------------------------------------------
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 2.6.0, 2.5.3
> Reporter: Bryan Beaudreault
> Assignee: Daniel Roudnitsky
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 2.5.12, 2.6.4
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop over all actions still being processed in the
> groupAndSendMulti at the time of the operation timeout being exceeded and set
> them as failed. The problem is, some number of these actions may have already
> failed to completion when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and throw off the (atomic) action counter accounting the
> sync client relies on to [tell when the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action in progress counter for all
> actions) in the groupAndSendMulti - which includes the aforementioned actions
> that were already failed through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> - causing us to decrement the actions in progress counter more times than
> than there are actions if there was a location failure. This causes an
> assertion error in the actions in progress counter since we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
> and should never have a negative amount of actions in progress, causing the
> HBase client to throw an unchecked exception that is not handled within the
> client which bubbles up to the user application layer invoking the client,
> which may kill the caller thread/application that invoked the operation that
> should have timed out with a RetriesExhaustedWithDetails exception (rather
> than throwing an unchecked AssertionError), as the user application layer
> should not be catching {{Error}} and its subclasses like
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta
> slowness when running batch operations. Suppose we have a batch with 3
> actions, and on trying to resolve the location for the first action, we
> timeout repeatedly to the meta table due to meta slowness and consume the
> entire operation timeout on the meta scan attempts to resolve the location of
> the first action. In this case, we will fail the first action through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> which bring the actionsInProgress counter to 2, and then we will [loop over
> all three
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> and fail each of them, on the third action failure attempt the actions in
> progress counter is zero and we attempt to decrement it to -1, and hit the
> assertion error. This is what the test case in the PR successfully
> reproduces.
> +Solution+
> We still want to fail all remaining/incomplete actions being processed in
> groupAndSendMulti at the time of the operation timeout being exceeded,
> because there is no time remaining to execute them, but we need special
> handling to avoid failing actions which were already failed due to failed
> location resolution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)