[jira] [Commented] (FLINK-21992) Investigate potential buffer leak in unaligned checkpoint

Arvid Heise (Jira) Tue, 13 Apr 2021 00:29:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319986#comment-17319986
 ]


Arvid Heise commented on FLINK-21992:
-------------------------------------

It turns out that there is an issue with notification. We managed to reliable 
reproduce it with:
* Unaligned checkpoints with
* Unions going into
* Two input tasks.

The root cause is a bug in {{UnionInputGate}} introduced in FLINK-19026. The 
available notification of {{UnionInputGate}} is simply reset too early, leading 
to stuck tasks.

The bug can probably also be triggered with single input tasks but there are 
certain factors that rectify the bug: If you drain a union gate entirely 
without looking at availability after the first buffer, the bug would not be 
visible. Since we hot-loop at plenty of places until running out of data, it 
might be that just the combination of the three things actually makes it 
visible.

> Investigate potential buffer leak in unaligned checkpoint
> ---------------------------------------------------------
>
>                 Key: FLINK-21992
>                 URL: https://issues.apache.org/jira/browse/FLINK-21992
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.0
>            Reporter: Arvid Heise
>            Assignee: Piotr Nowojski
>            Priority: Blocker
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21992) Investigate potential buffer leak in unaligned checkpoint

Reply via email to