[
https://issues.apache.org/jira/browse/FLINK-33879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiang Xin updated FLINK-33879:
------------------------------
Description:
Currently, the Hybrid Shuffle can work with the memory tier and disk tier
together, however, in the following scenario the result partition would stop
working.
Suppose we have a shuffle task with 2 sub-partitions. The LocalBufferPool has
15 buffers, the memory tier can use at most 15-(2*(2+1)+1) = 8 buffers
accroding to `TieredStorageMemoryManagerImpl#getMaxNonReclaimableBuffers`. If
the memory tier uses up all 8 buffers and the input channel doesn't consume
them because of some problem, the disk tier can still work with 1 reserved
buffer. However, if a redistribution happens now and the pool size is decreased
to less than 8, then the BufferAccumulator can not request buffers anymore, and
thus the result partition stops working as well.
was:
Currently, the Hybrid Shuffle can work with Memory Tier and Disk Tier together,
however, in the following scenirio the result partition would stop working.
Suppose we have a shuffle task with 2 sub partitions. The LocalBufferPool has
15 buffers, the memory tier can use at most 15-(2*(2+1)+1) = 8 buffers. If the
memory tier used up all 8 buffers and the input channel doesn't consume them
because of some problem, the disk tier can still work with 1 reserved buffer.
However, if a redistribution happens now and the pool size is decrease to less
than 8, then the BufferAccumulator can not request buffers any more, so
> Hybrid Shuffle may hang during redistribution
> ---------------------------------------------
>
> Key: FLINK-33879
> URL: https://issues.apache.org/jira/browse/FLINK-33879
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Reporter: Jiang Xin
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Currently, the Hybrid Shuffle can work with the memory tier and disk tier
> together, however, in the following scenario the result partition would stop
> working.
> Suppose we have a shuffle task with 2 sub-partitions. The LocalBufferPool has
> 15 buffers, the memory tier can use at most 15-(2*(2+1)+1) = 8 buffers
> accroding to `TieredStorageMemoryManagerImpl#getMaxNonReclaimableBuffers`. If
> the memory tier uses up all 8 buffers and the input channel doesn't consume
> them because of some problem, the disk tier can still work with 1 reserved
> buffer. However, if a redistribution happens now and the pool size is
> decreased to less than 8, then the BufferAccumulator can not request buffers
> anymore, and thus the result partition stops working as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)