[
https://issues.apache.org/jira/browse/PHOENIX-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lokesh Khurana resolved PHOENIX-7871.
-------------------------------------
Resolution: Resolved
> StaleClusterRoleRecordException + MutationBlockedIOException should extend
> DoNotRetryIOException; fail-fast batched-mutation path
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-7871
> URL: https://issues.apache.org/jira/browse/PHOENIX-7871
> Project: Phoenix
> Issue Type: Sub-task
> Reporter: Lokesh Khurana
> Assignee: Lokesh Khurana
> Priority: Major
>
> When the HA mutation-block gate fires server-side during the ATS transition
> window, batched mutations consume the full HBase retry budget
> (hbase.client.retries.number, default 16)
> before the caller sees the exception. Testing observed multi-second tails
> of failed mutations during what should be a sub-second client-visible event.
> Two related issues drive this.
> Issue 1: SCRE and MBE extend plain IOException
> StaleClusterRoleRecordException and MutationBlockedIOException both extends
> IOException. HBase's RPC retry layers (AsyncRequestFutureImpl.java,
> RpcRetryingCallerImpl.java) check instanceof DoNotRetryIOException on the
> cause to decide whether to retry. Today the check is false, so HBase retries
> the failed batch many times before propagating the exception to the Phoenix
> client. Each retry hits the same server-side gate and fails the same way.
> Both exception classes are unmistakably non-retryable: SCRE means the
> client's CRR cache is stale and a refresh is needed; MBE means the server is
> in the ATS window and mutations are blocked until the role flip completes.
> Fix: make both extend DoNotRetryIOException instead of plain IOException.
> Both already preserve a (String) constructor compatible with HBase's
> ProtobufUtil.toException reflection-based rehydration, so wire-compat is
> preserved.
> Issue 2: Batched-path retry layer doesn't unwrap RemoteWithExtrasException
> before the instanceof check
> Issue 1's fix is necessary but not sufficient. HBase 2's
> AsyncRequestFutureImpl.manageError is called from two paths:
> - receiveGlobalFailure — receives the exception post-translateException,
> after unwrapRemoteException() has already restored the rehydrated
> DNRIOE-typed instance. Fail-fast works
> here.
> - receiveMultiAction — receives the raw per-action result directly. The
> result arrives as the wire-form RemoteWithExtrasException, NOT as a
> rehydrated DNRIOE-typed instance. The instanceof DoNotRetryIOException check
> returns false. HBase retries.
> Phoenix's UPSERT batch path goes through MutationState.send → hTable.batch
> → AsyncRequestFutureImpl.batch → receiveMultiAction:967 → broken path. The
> single-row path (used for
> non-batched ops) goes through RpcRetryingCallerImpl → fixed by Issue 1's
> inheritance change alone.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)