[
https://issues.apache.org/jira/browse/FLINK-23041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yun Tang updated FLINK-23041:
-----------------------------
Component/s: Runtime / Checkpointing
> Change local alignment timeout back to the global time out
> ----------------------------------------------------------
>
> Key: FLINK-23041
> URL: https://issues.apache.org/jira/browse/FLINK-23041
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Reporter: Anton Kalashnikov
> Assignee: Anton Kalashnikov
> Priority: Major
> Labels: pull-request-available
>
> Local alignment timeouts are very confusing and especially without timeout on
> the outputs, they can significantly delay timeouting to UC.
> Problematic case is when all CBs are received with long delay because of the
> back pressure, but they arrive at the same time. Alignment time can be low
> (milliseconds), while start delay is ~1 minute. In that case checkpoint
> doesn't timeout to UC and is passing the responsibility to timeout down the
> stream.
>
> So it is not so transparant for the user why and when AC switches to UC. As
> mentioned before, the start delay is not correlated with the alignment
> timeout because it doesn't take into account time in output buffer. the
> alignment time is not fully correlated with the alignment timeout because the
> alignment time doesn't take into account the barrier announcement.
>
> Based on this, there is the proposal to change the semantic of
> alignmentTimeout configuration to such meaning:
> *The time between the starting of checkpoint(on the checkpont coordinator)
> and the time when the checkpoint barrier will be received by task.*
> By this definition, we will have kind of global timeout which says that if
> the AC isn't finished for alignmentTimeout time it will be switched to UC.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)