Lokesh Khurana created PHOENIX-7871:
---------------------------------------

             Summary: StaleClusterRoleRecordException + 
MutationBlockedIOException should extend DoNotRetryIOException; fail-fast 
batched-mutation path
                 Key: PHOENIX-7871
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7871
             Project: Phoenix
          Issue Type: Sub-task
            Reporter: Lokesh Khurana
            Assignee: Lokesh Khurana


When the HA mutation-block gate fires server-side during the ATS transition 
window, batched mutations consume the full HBase retry budget 
(hbase.client.retries.number, default 16)
  before the caller sees the exception. Testing observed multi-second tails of 
failed mutations during what should be a sub-second client-visible event. Two 
related issues drive this.

  Issue 1: SCRE and MBE extend plain IOException

  StaleClusterRoleRecordException and MutationBlockedIOException both extends 
IOException. HBase's RPC retry layers (AsyncRequestFutureImpl.java, 
RpcRetryingCallerImpl.java) check instanceof DoNotRetryIOException on the cause 
to decide whether to retry. Today the check is false, so HBase retries the 
failed batch many times before propagating the exception to the Phoenix client. 
Each retry hits the same server-side gate and fails the same way. Both 
exception classes are unmistakably non-retryable: SCRE means the client's CRR 
cache is stale and a refresh is needed; MBE means the server is in the ATS 
window and mutations are blocked until the role flip completes.

  Fix: make both extend DoNotRetryIOException instead of plain IOException. 
Both already preserve a (String) constructor compatible with HBase's 
ProtobufUtil.toException reflection-based rehydration, so wire-compat is 
preserved.

  Issue 2: Batched-path retry layer doesn't unwrap RemoteWithExtrasException 
before the instanceof check

  Issue 1's fix is necessary but not sufficient. HBase 2's 
AsyncRequestFutureImpl.manageError is called from two paths:
  - receiveGlobalFailure — receives the exception post-translateException, 
after unwrapRemoteException() has already restored the rehydrated DNRIOE-typed 
instance. Fail-fast works
  here.
  - receiveMultiAction — receives the raw per-action result directly. The 
result arrives as the wire-form RemoteWithExtrasException, NOT as a rehydrated 
DNRIOE-typed instance. The instanceof DoNotRetryIOException check returns 
false. HBase retries.

  Phoenix's UPSERT batch path goes through MutationState.send → hTable.batch → 
AsyncRequestFutureImpl.batch → receiveMultiAction:967 → broken path. The 
single-row path (used for
  non-batched ops) goes through RpcRetryingCallerImpl → fixed by Issue 1's 
inheritance change alone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to