[
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319986#comment-17319986
]
Arvid Heise commented on FLINK-21992:
-------------------------------------
It turns out that there is an issue with notification. We managed to reliable
reproduce it with:
* Unaligned checkpoints with
* Unions going into
* Two input tasks.
The root cause is a bug in {{UnionInputGate}} introduced in FLINK-19026. The
available notification of {{UnionInputGate}} is simply reset too early, leading
to stuck tasks.
The bug can probably also be triggered with single input tasks but there are
certain factors that rectify the bug: If you drain a union gate entirely
without looking at availability after the first buffer, the bug would not be
visible. Since we hot-loop at plenty of places until running out of data, it
might be that just the combination of the three things actually makes it
visible.
> Investigate potential buffer leak in unaligned checkpoint
> ---------------------------------------------------------
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.12.2, 1.13.0
> Reporter: Arvid Heise
> Assignee: Piotr Nowojski
> Priority: Blocker
>
> A user on mailing list reported that his job gets stuck with unaligned
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up,
> so it was not as easy to diagnose as this time where the initial report
> already contains many relevant data points.
> Beside a buffer leak, there could also be an issue with priority notification.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)