Weijie Guo created FLINK-29298:
----------------------------------
Summary: LocalBufferPool request buffer from NetworkBufferPool
hanging
Key: FLINK-29298
URL: https://issues.apache.org/jira/browse/FLINK-29298
Project: Flink
Issue Type: Bug
Components: Runtime / Network
Affects Versions: 1.16.0
Reporter: Weijie Guo
Attachments: image-2022-09-14-10-52-15-259.png,
image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png
In the scenario where the buffer contention is fierce, sometimes the task hang
can be observed. Through the thread dump information, we can found that the
task thread is blocked by requestMemorySegmentBlocking forever. After
investigating the dumped heap information, I found that the NetworkBufferPool
actually has many buffers, but the LocalBufferPool is still unavailable and no
buffer has been obtained.
By looking at the code, I am sure that this is a bug in thread race: when the
task thread polled out the last buffer in LocalBufferPool and triggered the
onGlobalPoolAvailable callback itself, it will skip this notification (as
currently the LocalBufferPool is available), which will cause the BufferPool to
eventually become unavailable and will never register a callback to the
NetworkBufferPool.
The conditions for triggering the problem are relatively strict, but I have
found a stable way to reproduce it, I will try to fix and verify this problem.
!image-2022-09-14-10-52-15-259.png|width=1021,height=219!
!image-2022-09-14-10-58-45-987.png|width=997,height=315!
!image-2022-09-14-11-00-47-309.png|width=453,height=121!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)