[jira] [Comment Edited] (FLINK-29298) LocalBufferPool request buffer from NetworkBufferPool hanging

Weijie Guo (Jira) Fri, 17 Mar 2023 20:28:09 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702051#comment-17702051
 ]


Weijie Guo edited comment on FLINK-29298 at 3/18/23 3:27 AM:
-------------------------------------------------------------

Hi [~lichen1109]. Unfortunately, this bug is only possible to reproduce in the 
case of strong buffer competition. However, I wrote a unit test for this PR, 
which can reproduce the problem with a high probability. 
In addition, there is another bug (FLINK-31293) that can also cause a similar 
phenomenon. Whether your job is a batch job or a stream job, there should be no 
similar problems with batch jobs in the latest release-1.17 and master branches.


was (Author: weijie guo):
Hi [~lichen1109]. Unfortunately, this bug is only possible to reproduce in the 
case of strong buffer competition. However, I wrote a unit test for this PR, 
which can reproduce the problem with a high probability. 
In addition, there is another Bug (FLINK-31293) that can also cause a similar 
phenomenon. Whether your job is a batch job or a stream job, there should be no 
similar problems with batch jobs in the latest release-1.17 and master branches.

> LocalBufferPool request buffer from NetworkBufferPool hanging
> -------------------------------------------------------------
>
>                 Key: FLINK-29298
>                 URL: https://issues.apache.org/jira/browse/FLINK-29298
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.16.0
>            Reporter: Weijie Guo
>            Assignee: Weijie Guo
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.17.0, 1.16.1
>
>         Attachments: image-2022-09-14-10-52-15-259.png, 
> image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png
>
>
> In the scenario where the buffer contention is fierce, sometimes the task 
> hang can be observed. Through the thread dump information, we can found that 
> the task thread is blocked by requestMemorySegmentBlocking forever. After 
> investigating the dumped heap information, I found that the NetworkBufferPool 
> actually has many buffers, but the LocalBufferPool is still unavailable and 
> no buffer has been obtained.
> By looking at the code, I am sure that this is a bug in thread race: when the 
> task thread polled out the last buffer in LocalBufferPool and triggered the 
> onGlobalPoolAvailable callback itself, it will skip this notification  (as 
> currently the LocalBufferPool is available), which will cause the BufferPool 
> to eventually become unavailable and will never register a callback to the 
> NetworkBufferPool.
> The conditions for triggering the problem are relatively strict, but I have 
> found a stable way to reproduce it, I will try to fix and verify this problem.
> !image-2022-09-14-10-52-15-259.png|width=1021,height=219!
> !image-2022-09-14-10-58-45-987.png|width=997,height=315!
> !image-2022-09-14-11-00-47-309.png|width=453,height=121!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-29298) LocalBufferPool request buffer from NetworkBufferPool hanging

Reply via email to