[
https://issues.apache.org/jira/browse/FLINK-33954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiang Xin updated FLINK-33954:
------------------------------
Description: In some cases, the job may hang when there are not enough
buffers in the local buffer pool. For instance, the parallelism is 4, so the
HashBufferAccumulator is used. The size of the local buffer pool can be 5, and
at some point, 3 of all buffers are required by 3 subpartitions and are not
finished, so only 2 buffers are left. If a record that is larger than 2 buffers
comes, the program would hang at requesting buffers. (was: In some cases, the
job may hang when there are not enough buffers in the local buffer pool. For
instance, the parallelism is 10, so the HashBufferAccumulator is used. The size
of local buffer pool is parallelism + 1
1. The local buffer pool size can be very small when the parallelism is small.
So when a large record comes and it needs more buffers than the buffer pool
has, a hang would happen.)
> Large record may cause the hybrid shuffle hang
> ----------------------------------------------
>
> Key: FLINK-33954
> URL: https://issues.apache.org/jira/browse/FLINK-33954
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Reporter: Jiang Xin
> Priority: Major
>
> In some cases, the job may hang when there are not enough buffers in the
> local buffer pool. For instance, the parallelism is 4, so the
> HashBufferAccumulator is used. The size of the local buffer pool can be 5,
> and at some point, 3 of all buffers are required by 3 subpartitions and are
> not finished, so only 2 buffers are left. If a record that is larger than 2
> buffers comes, the program would hang at requesting buffers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)