lokiore opened a new pull request, #2491:
URL: https://github.com/apache/phoenix/pull/2491
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://phoenix.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR.
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
PHOENIX-XXXX Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes, and
prefix it with the JIRA id, e.g. 'PHOENIX-XXXX Your PR title ...'.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section
is to outline the changes and how this PR fixes the issue.
-->
Two complementary changes that restore fail-fast behavior on the
batched-mutation path when a server-side HA failover signaling exception is
raised:
**1. Inheritance change on the two HA failover signaling exception classes**
— `MutationBlockedIOException` and `StaleClusterRoleRecordException` now extend
`org.apache.hadoop.hbase.DoNotRetryIOException` instead of
`java.io.IOException`. This is necessary so HBase's RPC retry layers
(`RpcRetryingCallerImpl`, the single-row path) correctly fail-fast on the first
hit instead of consuming the per-call retry budget.
**2. Phoenix-side intercept on the batched mutation path in
`MutationState.send`** — adds a cause-chain walker
`findHaFailoverCauseInChain(Throwable)` that detects either of the two HA
failover exceptions on the chain (handling both standard `getCause()` chains
and `RetriesExhaustedWithDetailsException` per-row inner causes). When the
helper finds a match and the new config flag
`phoenix.ha.failfast.failover.exceptions.enabled` is enabled (default `true`),
Phoenix's outer retry-decision logic at the catch site is skipped, so the
failure surfaces to Phoenix's HA-aware outer retry layer immediately rather
than being consumed by Phoenix-side retry-decision branches
(INDEX_METADATA_NOT_FOUND swallow, INDEX_WRITE_FAILURE retry).
The intercept is **narrowed to the two failover-specific exception classes**
(not all `DoNotRetryIOException` subtypes) so that bare `DoNotRetryIOException`
causes from unrelated code paths (test-infra index-write region observers,
etc.) flow through the existing retry-decision logic unchanged. A blanket
fail-fast on any DNRIOE-typed cause would break those flows.
The existing config flag-gate allows operators to opt out (set
`phoenix.ha.failfast.failover.exceptions.enabled=false`) and revert to pre-fix
behavior if the intercept is suspected of interacting badly with
deployment-specific retry semantics.
JIRA: https://issues.apache.org/jira/browse/PHOENIX-7871
### Why are the changes needed?
<!--
Please clarify why the changes are needed.
-->
The HBase 2.x batched-RPC retry path (`AsyncRequestFutureImpl.manageError`)
checks `instanceof DoNotRetryIOException` on the per-action result object
directly, *without* first running `RemoteException.unwrapRemoteException()`.
The per-action result therefore arrives as a wire-form
`RemoteWithExtrasException` — which extends `RemoteException` but **not**
`DoNotRetryIOException` — so the instanceof check fails and HBase retries
despite the server having explicitly thrown a DNRIOE-extending exception.
By the time the batched call completes, the retry budget has been consumed
and the failure is wrapped in `RetriesExhaustedWithDetailsException` before
reaching Phoenix's catch. For mutations issued during a brief mutation-block
window or against a now-STANDBY cluster, this manifests as multi-second tail
latencies that should have been single-RTT failures.
Phoenix's outer failover-aware retry (in the `FailoverPhoenixConnection` /
`HighAvailabilityGroup` layer) handles routing to the new ACTIVE on its own,
but only if the inner failure surfaces quickly. The client-side intercept in
`MutationState.send` re-establishes that quick surfacing for the batched path,
while preserving existing behavior for non-failover causes.
This helper becomes a no-op once HBase's batched-path retry layer is fixed
to consult the rehydrated DNRIOE bit (e.g. via
`RemoteWithExtrasException.isDoNotRetry()` or by calling
`unwrapRemoteException()` before the instanceof check). Until that ships,
Phoenix needs this client-side intercept.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as
new features, bug fixes, or other behavior changes.
-->
No
The only user-facing surface is the new opt-out config flag
`phoenix.ha.failfast.failover.exceptions.enabled` (default `true`). The
intercept is enabled by default; operators who wish to revert to pre-fix retry
behavior may set the flag to `false`. The exception-inheritance change is
internal — both exception classes still satisfy `instanceof IOException`, so
any existing `catch (IOException)` site continues to match.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some
test cases that check the changes thoroughly including negative and positive
cases if possible.
-->
**New unit tests (12 tests, all PASS):**
`HaFailoverExceptionInheritanceTest` (3 tests) — verifies the inheritance
change:
- `mutationBlockedExtendsDoNotRetryIOException`
- `staleClusterRoleRecordExtendsDoNotRetryIOException`
- `staleClusterRoleRecordTwoArgConstructorPreserved`
`MutationStateHaFailoverCauseTest` (9 tests) — verifies the cause-chain
walker:
- `returnsNullForNullInput` / `returnsNullWhenNoFailoverExceptionOnChain`
- `returnsNullForBareDoNotRetryWithoutFailoverSubtype` (regression guard
against over-broad fail-fast)
- `findsMutationBlockedAtTopLevel` / `findsStaleClusterRoleRecordAtTopLevel`
- `walksGetCauseChainForFailoverException`
- `findsFailoverCauseInRewdePerRowCauses` /
`findsFailoverCauseInRewdePerRowInnerCause`
- `selfReferenceSafe`
**Integration test (`IndexRegionObserverMutationBlockingIT`):** the existing
IT class is extended with new test methods (`testFailFastBatchedMutationBlock`,
`testInterceptDisabledViaConfigFlag`, etc.) that exercise the end-to-end
fail-fast behavior in a mini-cluster against a real region observer that throws
`MutationBlockedIOException`. The existing helper
`containsMutationBlockedException` is widened to handle both the post-fix
direct-chain layout (DNRIOE-typed inner) and the pre-fix REWDE-wrapped layout
(defense-in-depth).
Local commands run on `PHOENIX-7562-feature-new` HEAD:
```
mvn install -DskipTests #
full repo install
mvn -pl phoenix-core test
-Dtest='HaFailoverExceptionInheritanceTest,MutationStateHaFailoverCauseTest'
→ Tests run: 12, Failures: 0, Errors: 0, Skipped: 0
mvn -pl phoenix-core verify -DskipTests
-Dit.test=IndexRegionObserverMutationBlockingIT
→ IT run on mini-cluster (see test plan output)
```
### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
Generated-by: Claude Code (Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]