[
https://issues.apache.org/jira/browse/HBASE-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ZhenyuLi updated HBASE-28589:
-----------------------------
Description:
I'll help you polish this bug report to make it clearer and more professional.
Here's the refined version:
*Title: HBASE-14598 fix incomplete - DoNotRetryException not propagated to
client, causing cascading RegionServer failures*
*Description:*
I have discovered that the fix for HBASE-14598 does not completely resolve the
issue, and the problem persists in the latest branches (3.0 and 2.6).
*Background:* The original fix for HBASE-14598 addressed two aspects:
# When a Scan/Get RPC attempts to allocate an excessively large array that
could trigger an OutOfMemoryError (OOM), it checks the array size before
allocation and throws a {{BufferOverflowException}} to prevent RegionServer
crashes and potential cascading failures.
# The fix intended to stop client retries for such failures by throwing a
{{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying
cannot resolve the underlying issue.
*The Problem:* The {{DoNotRetryException}} is never propagated to the client
side. Here's the issue flow:
# {{ByteBufferOutputStream.checkSizeAndGrow()}} throws
{{BufferOverflowException}}
# {{ByteBufferOutputStream.write()}} catches it and throws
{{DoNotRetryException}}
# The exception propagates through the call stack:
** {{encoder.write()}}
** {{encodeCellsTo()}}
** {{this.cellBlockBuilder.buildCellBlockStream()}}
** {{call.setResponse()}}
# The {{DoNotRetryException}} is ultimately caught in
{{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the
client
# As a result, the client continues retrying indefinitely
*Current Status:* In the latest branches (3.0 and 2.6), this issue still
exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}}
({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the
{{setResponse()}} method follows the same problematic path. If a
{{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets
swallowed in the {{setResponse()}} catch block and never reaches the client.
*Steps to Reproduce:*
# Set up a 3-node HBase cluster with 3 RegionServers
# Set {{hbase.ipc.server.reservoir.enabled}} to {{false}}
# Inject a {{BufferOverflowException}} at
{{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
# Send a scan request
# Observe cascading RegionServer failures due to endless client retries
*Expected Behavior:* The {{DoNotRetryException}} should be properly propagated
to the client to prevent retry attempts.
*Actual Behavior:* The exception is caught and logged but not sent to the
client, resulting in continuous retries and cascading RegionServer failures.
*Impact:* This bug can cause cluster-wide outages when a single large request
triggers the issue, as client retries can overwhelm and crash multiple
RegionServers in sequence.
was:
I recently discovered that the fix for HBase-14598 does not completely resolve
the issue. Their fix addressed two aspects: first, when the Scan/Get RPC
attempts to allocate a very large array that could potentially lead to an
out-of-memory (OOM) error, it will check the size of the array before
allocation and directly throw an exception to prevent the region server from
crashing and avoid possible cascading failures. Second, the developer intends
for the client to stop retrying after such a failure, as retrying will not
resolve the issue.
However, their fix involved throwing a DoNotRetryException. After
ByteBufferOutputStream.write throws the DoNotRetryException, in the call stack
(ByteBufferOutputStream.write --> encoder.write --> encodeCellsTo -->
this.cellBlockBuilder.buildCellBlockStream --> call.setResponse), the
DoNotRetryException is ultimately caught in the CallRunner.run function, with
only a log printed. Consequently, the DoNotRetryException is not sent back to
the client side. Instead, the client receives a generic exception for the
failed RPC request and continues retrying, which is not the desired behavior. I
have reproduced this on the cluster.
In the code of CallRunner, it is obvious that the DoNotRetryException in
call.setResponse will be swallowed in the error handler with just a LOG printed.
> Client Does not Stop Retrying after DoNotRetryException
> -------------------------------------------------------
>
> Key: HBASE-28589
> URL: https://issues.apache.org/jira/browse/HBASE-28589
> Project: HBase
> Issue Type: Bug
> Components: IPC/RPC
> Affects Versions: 2.0.0, 2.4.0, 2.5.0, 2.6.0, 3.0.0
> Reporter: ZhenyuLi
> Priority: Major
>
> I'll help you polish this bug report to make it clearer and more
> professional. Here's the refined version:
> *Title: HBASE-14598 fix incomplete - DoNotRetryException not propagated to
> client, causing cascading RegionServer failures*
> *Description:*
> I have discovered that the fix for HBASE-14598 does not completely resolve
> the issue, and the problem persists in the latest branches (3.0 and 2.6).
> *Background:* The original fix for HBASE-14598 addressed two aspects:
> # When a Scan/Get RPC attempts to allocate an excessively large array that
> could trigger an OutOfMemoryError (OOM), it checks the array size before
> allocation and throws a {{BufferOverflowException}} to prevent RegionServer
> crashes and potential cascading failures.
> # The fix intended to stop client retries for such failures by throwing a
> {{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as
> retrying cannot resolve the underlying issue.
> *The Problem:* The {{DoNotRetryException}} is never propagated to the client
> side. Here's the issue flow:
> # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws
> {{BufferOverflowException}}
> # {{ByteBufferOutputStream.write()}} catches it and throws
> {{DoNotRetryException}}
> # The exception propagates through the call stack:
> ** {{encoder.write()}}
> ** {{encodeCellsTo()}}
> ** {{this.cellBlockBuilder.buildCellBlockStream()}}
> ** {{call.setResponse()}}
> # The {{DoNotRetryException}} is ultimately caught in
> {{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the
> client
> # As a result, the client continues retrying indefinitely
> *Current Status:* In the latest branches (3.0 and 2.6), this issue still
> exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}}
> ({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the
> {{setResponse()}} method follows the same problematic path. If a
> {{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets
> swallowed in the {{setResponse()}} catch block and never reaches the client.
> *Steps to Reproduce:*
> # Set up a 3-node HBase cluster with 3 RegionServers
> # Set {{hbase.ipc.server.reservoir.enabled}} to {{false}}
> # Inject a {{BufferOverflowException}} at
> {{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
> # Send a scan request
> # Observe cascading RegionServer failures due to endless client retries
> *Expected Behavior:* The {{DoNotRetryException}} should be properly
> propagated to the client to prevent retry attempts.
> *Actual Behavior:* The exception is caught and logged but not sent to the
> client, resulting in continuous retries and cascading RegionServer failures.
> *Impact:* This bug can cause cluster-wide outages when a single large request
> triggers the issue, as client retries can overwhelm and crash multiple
> RegionServers in sequence.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)