[jira] [Resolved] (PHOENIX-7871) StaleClusterRoleRecordException + MutationBlockedIOException should extend DoNotRetryIOException; fail-fast batched-mutation path

Lokesh Khurana (Jira) Tue, 16 Jun 2026 10:57:06 -0700


     [ 
https://issues.apache.org/jira/browse/PHOENIX-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lokesh Khurana resolved PHOENIX-7871.
-------------------------------------
    Resolution: Resolved

> StaleClusterRoleRecordException + MutationBlockedIOException should extend 
> DoNotRetryIOException; fail-fast batched-mutation path
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-7871
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7871
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Lokesh Khurana
>            Assignee: Lokesh Khurana
>            Priority: Major
>
> When the HA mutation-block gate fires server-side during the ATS transition 
> window, batched mutations consume the full HBase retry budget 
> (hbase.client.retries.number, default 16)
>   before the caller sees the exception. Testing observed multi-second tails 
> of failed mutations during what should be a sub-second client-visible event. 
> Two related issues drive this.
>   Issue 1: SCRE and MBE extend plain IOException
>   StaleClusterRoleRecordException and MutationBlockedIOException both extends 
> IOException. HBase's RPC retry layers (AsyncRequestFutureImpl.java, 
> RpcRetryingCallerImpl.java) check instanceof DoNotRetryIOException on the 
> cause to decide whether to retry. Today the check is false, so HBase retries 
> the failed batch many times before propagating the exception to the Phoenix 
> client. Each retry hits the same server-side gate and fails the same way. 
> Both exception classes are unmistakably non-retryable: SCRE means the 
> client's CRR cache is stale and a refresh is needed; MBE means the server is 
> in the ATS window and mutations are blocked until the role flip completes.
>   Fix: make both extend DoNotRetryIOException instead of plain IOException. 
> Both already preserve a (String) constructor compatible with HBase's 
> ProtobufUtil.toException reflection-based rehydration, so wire-compat is 
> preserved.
>   Issue 2: Batched-path retry layer doesn't unwrap RemoteWithExtrasException 
> before the instanceof check
>   Issue 1's fix is necessary but not sufficient. HBase 2's 
> AsyncRequestFutureImpl.manageError is called from two paths:
>   - receiveGlobalFailure — receives the exception post-translateException, 
> after unwrapRemoteException() has already restored the rehydrated 
> DNRIOE-typed instance. Fail-fast works
>   here.
>   - receiveMultiAction — receives the raw per-action result directly. The 
> result arrives as the wire-form RemoteWithExtrasException, NOT as a 
> rehydrated DNRIOE-typed instance. The instanceof DoNotRetryIOException check 
> returns false. HBase retries.
>   Phoenix's UPSERT batch path goes through MutationState.send → hTable.batch 
> → AsyncRequestFutureImpl.batch → receiveMultiAction:967 → broken path. The 
> single-row path (used for
>   non-batched ops) goes through RpcRetryingCallerImpl → fixed by Issue 1's 
> inheritance change alone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (PHOENIX-7871) StaleClusterRoleRecordException + MutationBlockedIOException should extend DoNotRetryIOException; fail-fast batched-mutation path

Reply via email to