[
https://issues.apache.org/jira/browse/HBASE-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ZhenyuLi updated HBASE-28589:
-----------------------------
Description:
I have discovered that the fix for HBASE-14598 does not completely resolve the
issue, and the problem persists in the latest branches (3.0 and 2.6).
The original fix for HBASE-14598 addressed two aspects:
# When a Scan/Get RPC attempts to allocate an excessively large array that
could trigger an OutOfMemoryError (OOM), it checks the array size before
allocation and throws a {{BufferOverflowException}} to prevent OOM.
# The fix intended to stop client retries for such failures by throwing a
{{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying
cannot resolve the underlying issue.
*The Problem:* The {{DoNotRetryException}} is never propagated to the client
side. Here's the issue flow:
# {{ByteBufferOutputStream.checkSizeAndGrow()}} throws
{{BufferOverflowException}}
# The exception propagates through the call stack:
** {{encoder.write()}}
** {{encodeCellsTo() (Catch BufferOverflowException and turn it into
DoNotRetryIOException)}}
** {{this.cellBlockBuilder.buildCellBlockStream()}}
** {{call.setResponse()}}
# The {{DoNotRetryException}} is ultimately caught in
{{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the
client
# As a result, the client continues retrying indefinitely
*Current Status:* In the latest branches (3.0 and 2.6), this issue still
exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}}
({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the
{{setResponse()}} method follows the same problematic path. If a
{{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets
swallowed in the {{setResponse()}} catch block and never reaches the client.
*Steps to Reproduce:*
# Set up a 3-node HBase cluster with 3 RegionServers
# Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use
ByteBufferOutputStream}}
# Inject a {{BufferOverflowException}} at
{{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
# Send a scan request
# Observe cascading RegionServer failures due to endless client retries
*Expected Behavior:* The {{DoNotRetryException}} should be properly propagated
to the client to prevent retry attempts.
*Impact:* This bug can cause cluster-wide outages when a single large request
triggers the issue, as client retries can overwhelm and crash multiple
RegionServers in sequence.
was:
I have discovered that the fix for HBASE-14598 does not completely resolve the
issue, and the problem persists in the latest branches (3.0 and 2.6).
The original fix for HBASE-14598 addressed two aspects:
# When a Scan/Get RPC attempts to allocate an excessively large array that
could trigger an OutOfMemoryError (OOM), it checks the array size before
allocation and throws a {{BufferOverflowException}} to prevent RegionServer
crashes and potential cascading failures.
# The fix intended to stop client retries for such failures by throwing a
{{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying
cannot resolve the underlying issue.
*The Problem:* The {{DoNotRetryException}} is never propagated to the client
side. Here's the issue flow:
# {{ByteBufferOutputStream.checkSizeAndGrow()}} throws
{{BufferOverflowException}}
# The exception propagates through the call stack:
** {{encoder.write()}}
** {{encodeCellsTo() (Catch BufferOverflowException and turn it into
DoNotRetryIOException)}}
** {{this.cellBlockBuilder.buildCellBlockStream()}}
** {{call.setResponse()}}
# The {{DoNotRetryException}} is ultimately caught in
{{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the
client
# As a result, the client continues retrying indefinitely
*Current Status:* In the latest branches (3.0 and 2.6), this issue still
exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}}
({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the
{{setResponse()}} method follows the same problematic path. If a
{{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets
swallowed in the {{setResponse()}} catch block and never reaches the client.
*Steps to Reproduce:*
# Set up a 3-node HBase cluster with 3 RegionServers
# Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use
ByteBufferOutputStream}}
# Inject a {{BufferOverflowException}} at
{{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
# Send a scan request
# Observe cascading RegionServer failures due to endless client retries
*Expected Behavior:* The {{DoNotRetryException}} should be properly propagated
to the client to prevent retry attempts.
*Impact:* This bug can cause cluster-wide outages when a single large request
triggers the issue, as client retries can overwhelm and crash multiple
RegionServers in sequence.
> HBASE-14598 fix incomplete - DoNotRetryException not propagated to client,
> causing cascading RegionServer failures
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-28589
> URL: https://issues.apache.org/jira/browse/HBASE-28589
> Project: HBase
> Issue Type: Bug
> Components: IPC/RPC
> Affects Versions: 2.0.0, 2.4.0, 2.5.0, 2.6.0, 3.0.0
> Reporter: ZhenyuLi
> Priority: Major
>
> I have discovered that the fix for HBASE-14598 does not completely resolve
> the issue, and the problem persists in the latest branches (3.0 and 2.6).
> The original fix for HBASE-14598 addressed two aspects:
> # When a Scan/Get RPC attempts to allocate an excessively large array that
> could trigger an OutOfMemoryError (OOM), it checks the array size before
> allocation and throws a {{BufferOverflowException}} to prevent OOM.
> # The fix intended to stop client retries for such failures by throwing a
> {{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as
> retrying cannot resolve the underlying issue.
> *The Problem:* The {{DoNotRetryException}} is never propagated to the client
> side. Here's the issue flow:
> # {{ByteBufferOutputStream.checkSizeAndGrow()}} throws
> {{BufferOverflowException}}
> # The exception propagates through the call stack:
> ** {{encoder.write()}}
> ** {{encodeCellsTo() (Catch BufferOverflowException and turn it into
> DoNotRetryIOException)}}
> ** {{this.cellBlockBuilder.buildCellBlockStream()}}
> ** {{call.setResponse()}}
> # The {{DoNotRetryException}} is ultimately caught in
> {{{}CallRunner.run(){}}}, where it is merely logged but not sent back to the
> client
> # As a result, the client continues retrying indefinitely
> *Current Status:* In the latest branches (3.0 and 2.6), this issue still
> exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}}
> ({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the
> {{setResponse()}} method follows the same problematic path. If a
> {{DoNotRetryException}} is thrown in the {{encodeCellsTo()}} method, it gets
> swallowed in the {{setResponse()}} catch block and never reaches the client.
> *Steps to Reproduce:*
> # Set up a 3-node HBase cluster with 3 RegionServers
> # Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use
> ByteBufferOutputStream}}
> # Inject a {{BufferOverflowException}} at
> {{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
> # Send a scan request
> # Observe cascading RegionServer failures due to endless client retries
> *Expected Behavior:* The {{DoNotRetryException}} should be properly
> propagated to the client to prevent retry attempts.
> *Impact:* This bug can cause cluster-wide outages when a single large request
> triggers the issue, as client retries can overwhelm and crash multiple
> RegionServers in sequence.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)