[jira] [Updated] (HBASE-28589) HBASE-14598 fix incomplete - DoNotRetryException not propagated to client, causing cascading RegionServer failures

ZhenyuLi (Jira) Mon, 14 Jul 2025 14:14:05 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ZhenyuLi updated HBASE-28589:
-----------------------------
    Description: 
I have discovered that the fix for HBASE-14598 does not completely resolve the 
issue, and the problem persists in the latest branches (3.0 and 2.6).

The original fix for HBASE-14598 addressed two aspects:
 # When a Scan/Get RPC attempts to allocate an excessively large array that 
could trigger an OutOfMemoryError (OOM), it checks the array size before 
allocation and throws a {{BufferOverflowException}} to prevent OOM.
 # The fix intended to stop client retries for such failures by throwing a 
{{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying 
cannot resolve the underlying issue.

*The Problem:* The {{DoNotRetryException}} is never propagated to the client 
side. Here's the issue flow:
 # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws 
{{BufferOverflowException}}
 # The exception propagates through the call stack:
 ** {{encoder.write()}}
 ** {{encodeCellsTo() (Catch BufferOverflowException and turn it into 
DoNotRetryIOException)}}
 ** {{this.cellBlockBuilder.buildCellBlockStream()}}
 ** {{call.setResponse()}}
 # The {{DoNotRetryException}} is ultimately caught in 
{{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the 
client
 # As a result, the client continues retrying indefinitely

*Current Status:* In the latest branches (3.0 and 2.6), this issue still 
exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}} 
({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the 
{{setResponse()}} method follows the same problematic path. If a 
{{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets 
swallowed in the {{setResponse()}} catch block and never reaches the client.

*Steps to Reproduce:*
 # Set up a 3-node HBase cluster with 3 RegionServers
 # Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use 
ByteBufferOutputStream}}
 # Inject a {{BufferOverflowException}} at 
{{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
 # Send a scan request
 # Observe cascading RegionServer failures due to endless client retries

*Expected Behavior:* The {{DoNotRetryException}} should be properly propagated 
to the client to prevent retry attempts.

*Impact:* This bug can cause cluster-wide outages when a single large request 
triggers the issue, as client retries can overwhelm and crash multiple 
RegionServers in sequence.

  was:
I have discovered that the fix for HBASE-14598 does not completely resolve the 
issue, and the problem persists in the latest branches (3.0 and 2.6).

The original fix for HBASE-14598 addressed two aspects:
 # When a Scan/Get RPC attempts to allocate an excessively large array that 
could trigger an OutOfMemoryError (OOM), it checks the array size before 
allocation and throws a {{BufferOverflowException}} to prevent RegionServer 
crashes and potential cascading failures.
 # The fix intended to stop client retries for such failures by throwing a 
{{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying 
cannot resolve the underlying issue.

*The Problem:* The {{DoNotRetryException}} is never propagated to the client 
side. Here's the issue flow:
 # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws 
{{BufferOverflowException}}
 # The exception propagates through the call stack:
 ** {{encoder.write()}}
 ** {{encodeCellsTo() (Catch BufferOverflowException and turn it into 
DoNotRetryIOException)}}
 ** {{this.cellBlockBuilder.buildCellBlockStream()}}
 ** {{call.setResponse()}}
 # The {{DoNotRetryException}} is ultimately caught in 
{{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the 
client
 # As a result, the client continues retrying indefinitely

*Current Status:* In the latest branches (3.0 and 2.6), this issue still 
exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}} 
({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the 
{{setResponse()}} method follows the same problematic path. If a 
{{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets 
swallowed in the {{setResponse()}} catch block and never reaches the client.

*Steps to Reproduce:*
 # Set up a 3-node HBase cluster with 3 RegionServers
 # Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use 
ByteBufferOutputStream}}
 # Inject a {{BufferOverflowException}} at 
{{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
 # Send a scan request
 # Observe cascading RegionServer failures due to endless client retries

*Expected Behavior:* The {{DoNotRetryException}} should be properly propagated 
to the client to prevent retry attempts.

*Impact:* This bug can cause cluster-wide outages when a single large request 
triggers the issue, as client retries can overwhelm and crash multiple 
RegionServers in sequence.


> HBASE-14598 fix incomplete - DoNotRetryException not propagated to client, 
> causing cascading RegionServer failures
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-28589
>                 URL: https://issues.apache.org/jira/browse/HBASE-28589
>             Project: HBase
>          Issue Type: Bug
>          Components: IPC/RPC
>    Affects Versions: 2.0.0, 2.4.0, 2.5.0, 2.6.0, 3.0.0
>            Reporter: ZhenyuLi
>            Priority: Major
>
> I have discovered that the fix for HBASE-14598 does not completely resolve 
> the issue, and the problem persists in the latest branches (3.0 and 2.6).
> The original fix for HBASE-14598 addressed two aspects:
>  # When a Scan/Get RPC attempts to allocate an excessively large array that 
> could trigger an OutOfMemoryError (OOM), it checks the array size before 
> allocation and throws a {{BufferOverflowException}} to prevent OOM.
>  # The fix intended to stop client retries for such failures by throwing a 
> {{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as 
> retrying cannot resolve the underlying issue.
> *The Problem:* The {{DoNotRetryException}} is never propagated to the client 
> side. Here's the issue flow:
>  # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws 
> {{BufferOverflowException}}
>  # The exception propagates through the call stack:
>  ** {{encoder.write()}}
>  ** {{encodeCellsTo() (Catch BufferOverflowException and turn it into 
> DoNotRetryIOException)}}
>  ** {{this.cellBlockBuilder.buildCellBlockStream()}}
>  ** {{call.setResponse()}}
>  # The {{DoNotRetryException}} is ultimately caught in 
> {{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the 
> client
>  # As a result, the client continues retrying indefinitely
> *Current Status:* In the latest branches (3.0 and 2.6), this issue still 
> exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}} 
> ({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the 
> {{setResponse()}} method follows the same problematic path. If a 
> {{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets 
> swallowed in the {{setResponse()}} catch block and never reaches the client.
> *Steps to Reproduce:*
>  # Set up a 3-node HBase cluster with 3 RegionServers
>  # Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use 
> ByteBufferOutputStream}}
>  # Inject a {{BufferOverflowException}} at 
> {{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
>  # Send a scan request
>  # Observe cascading RegionServer failures due to endless client retries
> *Expected Behavior:* The {{DoNotRetryException}} should be properly 
> propagated to the client to prevent retry attempts.
> *Impact:* This bug can cause cluster-wide outages when a single large request 
> triggers the issue, as client retries can overwhelm and crash multiple 
> RegionServers in sequence.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-28589) HBASE-14598 fix incomplete - DoNotRetryException not propagated to client, causing cascading RegionServer failures

Reply via email to