Yingjie Cao created FLINK-31386: ----------------------------------- Summary: Fix the potential deadlock issue of blocking shuffle Key: FLINK-31386 URL: https://issues.apache.org/jira/browse/FLINK-31386 Project: Flink Issue Type: Bug Components: Runtime / Network Reporter: Yingjie Cao Fix For: 1.17.0
Currently, theĀ SortMergeResultPartition may allocate more network buffers than the guaranteed size of the LocalBufferPool. As a result, some result partitions may need to wait other result partitions to release the over-allocated network buffers to continue. However, the result partitions which have allocated more than guaranteed buffers relies on the processing of input data to trigger data spilling and buffer recycling. The input data further relies on batch reading buffers used by theĀ SortMergeResultPartitionReadScheduler which may already taken by those blocked result partitions which are waiting for buffers. Then deadlock occurs. We can easily fix this deadlock by reserving the guaranteed buffers on initializing. -- This message was sent by Atlassian Jira (v8.20.10#820010)