[ 
https://issues.apache.org/jira/browse/HBASE-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhenyuLi updated HBASE-28589:
-----------------------------
    Description: 
I'll help you polish this bug report to make it clearer and more professional. 
Here's the refined version:

*Title: HBASE-14598 fix incomplete - DoNotRetryException not propagated to 
client, causing cascading RegionServer failures*

*Description:*

I have discovered that the fix for HBASE-14598 does not completely resolve the 
issue, and the problem persists in the latest branches (3.0 and 2.6).

*Background:* The original fix for HBASE-14598 addressed two aspects:
 # When a Scan/Get RPC attempts to allocate an excessively large array that 
could trigger an OutOfMemoryError (OOM), it checks the array size before 
allocation and throws a {{BufferOverflowException}} to prevent RegionServer 
crashes and potential cascading failures.
 # The fix intended to stop client retries for such failures by throwing a 
{{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying 
cannot resolve the underlying issue.

*The Problem:* The {{DoNotRetryException}} is never propagated to the client 
side. Here's the issue flow:
 # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws 
{{BufferOverflowException}}
 # {{ByteBufferOutputStream.write()}} catches it and throws 
{{DoNotRetryException}}
 # The exception propagates through the call stack:
 ** {{encoder.write()}}
 ** {{encodeCellsTo()}}
 ** {{this.cellBlockBuilder.buildCellBlockStream()}}
 ** {{call.setResponse()}}
 # The {{DoNotRetryException}} is ultimately caught in 
{{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the 
client
 # As a result, the client continues retrying indefinitely

*Current Status:* In the latest branches (3.0 and 2.6), this issue still 
exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}} 
({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the 
{{setResponse()}} method follows the same problematic path. If a 
{{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets 
swallowed in the {{setResponse()}} catch block and never reaches the client.

*Steps to Reproduce:*
 # Set up a 3-node HBase cluster with 3 RegionServers
 # Set {{hbase.ipc.server.reservoir.enabled}} to {{false}}
 # Inject a {{BufferOverflowException}} at 
{{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
 # Send a scan request
 # Observe cascading RegionServer failures due to endless client retries

*Expected Behavior:* The {{DoNotRetryException}} should be properly propagated 
to the client to prevent retry attempts.

*Actual Behavior:* The exception is caught and logged but not sent to the 
client, resulting in continuous retries and cascading RegionServer failures.

*Impact:* This bug can cause cluster-wide outages when a single large request 
triggers the issue, as client retries can overwhelm and crash multiple 
RegionServers in sequence.

  was:
I recently discovered that the fix for HBase-14598 does not completely resolve 
the issue. Their fix addressed two aspects: first, when the Scan/Get RPC 
attempts to allocate a very large array that could potentially lead to an 
out-of-memory (OOM) error, it will check the size of the array before 
allocation and directly throw an exception to prevent the region server from 
crashing and avoid possible cascading failures. Second, the developer intends 
for the client to stop retrying after such a failure, as retrying will not 
resolve the issue.

However, their fix involved throwing a DoNotRetryException. After 
ByteBufferOutputStream.write throws the DoNotRetryException, in the call stack 
(ByteBufferOutputStream.write --> encoder.write --> encodeCellsTo --> 
this.cellBlockBuilder.buildCellBlockStream --> call.setResponse), the 
DoNotRetryException is ultimately caught in the CallRunner.run function, with 
only a log printed. Consequently, the DoNotRetryException is not sent back to 
the client side. Instead, the client receives a generic exception for the 
failed RPC request and continues retrying, which is not the desired behavior. I 
have reproduced this on the cluster.

In the code of CallRunner, it is obvious that the DoNotRetryException in 
call.setResponse will be swallowed in the error handler with just a LOG printed.


> Client Does not Stop Retrying after DoNotRetryException
> -------------------------------------------------------
>
>                 Key: HBASE-28589
>                 URL: https://issues.apache.org/jira/browse/HBASE-28589
>             Project: HBase
>          Issue Type: Bug
>          Components: IPC/RPC
>    Affects Versions: 2.0.0, 2.4.0, 2.5.0, 2.6.0, 3.0.0
>            Reporter: ZhenyuLi
>            Priority: Major
>
> I'll help you polish this bug report to make it clearer and more 
> professional. Here's the refined version:
> *Title: HBASE-14598 fix incomplete - DoNotRetryException not propagated to 
> client, causing cascading RegionServer failures*
> *Description:*
> I have discovered that the fix for HBASE-14598 does not completely resolve 
> the issue, and the problem persists in the latest branches (3.0 and 2.6).
> *Background:* The original fix for HBASE-14598 addressed two aspects:
>  # When a Scan/Get RPC attempts to allocate an excessively large array that 
> could trigger an OutOfMemoryError (OOM), it checks the array size before 
> allocation and throws a {{BufferOverflowException}} to prevent RegionServer 
> crashes and potential cascading failures.
>  # The fix intended to stop client retries for such failures by throwing a 
> {{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as 
> retrying cannot resolve the underlying issue.
> *The Problem:* The {{DoNotRetryException}} is never propagated to the client 
> side. Here's the issue flow:
>  # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws 
> {{BufferOverflowException}}
>  # {{ByteBufferOutputStream.write()}} catches it and throws 
> {{DoNotRetryException}}
>  # The exception propagates through the call stack:
>  ** {{encoder.write()}}
>  ** {{encodeCellsTo()}}
>  ** {{this.cellBlockBuilder.buildCellBlockStream()}}
>  ** {{call.setResponse()}}
>  # The {{DoNotRetryException}} is ultimately caught in 
> {{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the 
> client
>  # As a result, the client continues retrying indefinitely
> *Current Status:* In the latest branches (3.0 and 2.6), this issue still 
> exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}} 
> ({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the 
> {{setResponse()}} method follows the same problematic path. If a 
> {{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets 
> swallowed in the {{setResponse()}} catch block and never reaches the client.
> *Steps to Reproduce:*
>  # Set up a 3-node HBase cluster with 3 RegionServers
>  # Set {{hbase.ipc.server.reservoir.enabled}} to {{false}}
>  # Inject a {{BufferOverflowException}} at 
> {{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
>  # Send a scan request
>  # Observe cascading RegionServer failures due to endless client retries
> *Expected Behavior:* The {{DoNotRetryException}} should be properly 
> propagated to the client to prevent retry attempts.
> *Actual Behavior:* The exception is caught and logged but not sent to the 
> client, resulting in continuous retries and cascading RegionServer failures.
> *Impact:* This bug can cause cluster-wide outages when a single large request 
> triggers the issue, as client retries can overwhelm and crash multiple 
> RegionServers in sequence.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to