Lokesh Khurana created PHOENIX-7871:
---------------------------------------
Summary: StaleClusterRoleRecordException +
MutationBlockedIOException should extend DoNotRetryIOException; fail-fast
batched-mutation path
Key: PHOENIX-7871
URL: https://issues.apache.org/jira/browse/PHOENIX-7871
Project: Phoenix
Issue Type: Sub-task
Reporter: Lokesh Khurana
Assignee: Lokesh Khurana
When the HA mutation-block gate fires server-side during the ATS transition
window, batched mutations consume the full HBase retry budget
(hbase.client.retries.number, default 16)
before the caller sees the exception. Testing observed multi-second tails of
failed mutations during what should be a sub-second client-visible event. Two
related issues drive this.
Issue 1: SCRE and MBE extend plain IOException
StaleClusterRoleRecordException and MutationBlockedIOException both extends
IOException. HBase's RPC retry layers (AsyncRequestFutureImpl.java,
RpcRetryingCallerImpl.java) check instanceof DoNotRetryIOException on the cause
to decide whether to retry. Today the check is false, so HBase retries the
failed batch many times before propagating the exception to the Phoenix client.
Each retry hits the same server-side gate and fails the same way. Both
exception classes are unmistakably non-retryable: SCRE means the client's CRR
cache is stale and a refresh is needed; MBE means the server is in the ATS
window and mutations are blocked until the role flip completes.
Fix: make both extend DoNotRetryIOException instead of plain IOException.
Both already preserve a (String) constructor compatible with HBase's
ProtobufUtil.toException reflection-based rehydration, so wire-compat is
preserved.
Issue 2: Batched-path retry layer doesn't unwrap RemoteWithExtrasException
before the instanceof check
Issue 1's fix is necessary but not sufficient. HBase 2's
AsyncRequestFutureImpl.manageError is called from two paths:
- receiveGlobalFailure — receives the exception post-translateException,
after unwrapRemoteException() has already restored the rehydrated DNRIOE-typed
instance. Fail-fast works
here.
- receiveMultiAction — receives the raw per-action result directly. The
result arrives as the wire-form RemoteWithExtrasException, NOT as a rehydrated
DNRIOE-typed instance. The instanceof DoNotRetryIOException check returns
false. HBase retries.
Phoenix's UPSERT batch path goes through MutationState.send → hTable.batch →
AsyncRequestFutureImpl.batch → receiveMultiAction:967 → broken path. The
single-row path (used for
non-batched ops) goes through RpcRetryingCallerImpl → fixed by Issue 1's
inheritance change alone.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)