[ 
https://issues.apache.org/jira/browse/FLINK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327850#comment-17327850
 ] 

Flink Jira Bot commented on FLINK-17531:
----------------------------------------

This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Add a new checkpoint Guage metric: elapsedTimeSinceLastCompletedCheckpoint
> --------------------------------------------------------------------------
>
>                 Key: FLINK-17531
>                 URL: https://issues.apache.org/jira/browse/FLINK-17531
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.10.0
>            Reporter: Steven Zhen Wu
>            Priority: Major
>              Labels: stale-major
>
> like to discuss the value of a new checkpoint Guage metric: 
> `elapsedTimeSinceLastCompletedCheckpoint`. Main motivation is for alerting. I 
> know reasons below are somewhat related to our setup. Hence want to explore 
> the interest of the community.
>  
> *What do we want to achieve?*
> We want to alert if no successful checkpoint happened for a specific period. 
> With this new metric, we can set up a simple alerting rule like `alert if 
> elapsedTimeSinceLastCompletedCheckpoint > N minutes`. It is a good alerting 
> pattern of `time since last success`. We found 
> `elapsedTimeSinceLastCompletedCheckpoint` metric very easy and intuitive to 
> set up alert against.
>  
> *What out existing checkpoint metrics?*
> `numberOfCompletedCheckpoints`. We can set up an alert like `alert if 
> derivative(numberOfCompletedCheckpoints) == 0 for N minutes`. However, it is 
> an anti-pattern for our alerting system, as it is looking for lack of good 
> signal (vs explicit bad signal). Such an anti-pattern is easier to suffer 
> false alarm problem when there is occasional metric drop or alerting system 
> processing issue.
>  
> `numberOfFailedCheckpoints`. That is an explicit failure signal, which is 
> good. We can set up alert like `alert if 
> derivative(numberOfFailedCheckpoints) > 0 in X out Y minutes`. We have some 
> high-parallelism large-state jobs. Their normal checkpoint duration is <1-2 
> minutes. However, when recovering from an outage with large backlog, 
> sometimes subtasks from one or a few containers experienced super high back 
> pressure. It took checkpoint barrier sometimes more than an hour to travel 
> through the DAG to those heavy back pressured subtasks. Causes of the back 
> pressure are likely due to multi-tenancy environment and performance 
> variation among containers. Instead of letting checkpoint to time out in this 
> case, we decided to increase checkpoint timeout value to crazy long value 
> (like 2 hours). With that, we kind of missed the explicit "bad" signal of 
> failed/timed out checkpoint.
>  
> In theory, one could argue that we can set checkpoint timeout to infinity. It 
> is always better to have a long but completed checkpoint than a timed out 
> checkpoint, as timed out checkpoint basically give up its positions in the 
> queue and new checkpoint just reset the positions back to the end of the 
> queue . Note that we are using at least checkpoint semantics. So there is no 
> barrier alignment concern. FLIP-76 (unaligned checkpoints) can help 
> checkpoint dealing with back pressure better. It is not ready now and also 
> has its limitations. That is a separate discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to