[ https://issues.apache.org/jira/browse/FLINK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702051#comment-17702051 ]
Weijie Guo edited comment on FLINK-29298 at 3/18/23 3:27 AM: ------------------------------------------------------------- Hi [~lichen1109]. Unfortunately, this bug is only possible to reproduce in the case of strong buffer competition. However, I wrote a unit test for this PR, which can reproduce the problem with a high probability. In addition, there is another bug (FLINK-31293) that can also cause a similar phenomenon. Whether your job is a batch job or a stream job, there should be no similar problems with batch jobs in the latest release-1.17 and master branches. was (Author: weijie guo): Hi [~lichen1109]. Unfortunately, this bug is only possible to reproduce in the case of strong buffer competition. However, I wrote a unit test for this PR, which can reproduce the problem with a high probability. In addition, there is another Bug (FLINK-31293) that can also cause a similar phenomenon. Whether your job is a batch job or a stream job, there should be no similar problems with batch jobs in the latest release-1.17 and master branches. > LocalBufferPool request buffer from NetworkBufferPool hanging > ------------------------------------------------------------- > > Key: FLINK-29298 > URL: https://issues.apache.org/jira/browse/FLINK-29298 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.16.0 > Reporter: Weijie Guo > Assignee: Weijie Guo > Priority: Critical > Labels: pull-request-available > Fix For: 1.17.0, 1.16.1 > > Attachments: image-2022-09-14-10-52-15-259.png, > image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png > > > In the scenario where the buffer contention is fierce, sometimes the task > hang can be observed. Through the thread dump information, we can found that > the task thread is blocked by requestMemorySegmentBlocking forever. After > investigating the dumped heap information, I found that the NetworkBufferPool > actually has many buffers, but the LocalBufferPool is still unavailable and > no buffer has been obtained. > By looking at the code, I am sure that this is a bug in thread race: when the > task thread polled out the last buffer in LocalBufferPool and triggered the > onGlobalPoolAvailable callback itself, it will skip this notification (as > currently the LocalBufferPool is available), which will cause the BufferPool > to eventually become unavailable and will never register a callback to the > NetworkBufferPool. > The conditions for triggering the problem are relatively strict, but I have > found a stable way to reproduce it, I will try to fix and verify this problem. > !image-2022-09-14-10-52-15-259.png|width=1021,height=219! > !image-2022-09-14-10-58-45-987.png|width=997,height=315! > !image-2022-09-14-11-00-47-309.png|width=453,height=121! -- This message was sent by Atlassian Jira (v8.20.10#820010)