Steven Zhen Wu created FLINK-17531:
--------------------------------------

             Summary: Add a new checkpoint Guage metric: 
elapsedSecondsSinceLastCompletedCheckpoint
                 Key: FLINK-17531
                 URL: https://issues.apache.org/jira/browse/FLINK-17531
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Checkpointing
    Affects Versions: 1.10.0
            Reporter: Steven Zhen Wu


like to discuss the value of a new checkpoint Guage metric: 
`elapsedSecondsSinceLastCompletedCheckpoint`. Main motivation is for alerting. 
I know reasons below are somewhat related to our setup. Hence want to explore 
the interest of the community.

*What do we want to achieve?*

We want to alert if no successful checkpoint happened for a specific period. 
With this new metric, we can set up a simple alerting rule like `alert if 
elapsedSecondsSinceLastCompletedCheckpoint > N minutes`. It is a good alerting 
pattern of `time since last success`.

*What out existing checkpoint metrics?*

* `numberOfCompletedCheckpoints`. We can set up an alert like `alert if 
numberOfCompletedCheckpoints = 0 for N minutes`. However, it is an anti-pattern 
for our alerting system, as it is looking for lack of good signal (vs explicit 
bad signal). Such an anti-pattern is easier to suffer false alarm problem when 
there is occasional metric drop or alerting system processing issue.

* numberOfFailedCheckpoints. That is an explicit failure signal, which is good. 
We can set up alert like `alert if numberOfFailedCheckpoints > 0 in X out Y 
minutes`. We have some high-parallelism large-state jobs. Their normal 
checkpoint duration is <1-2 minutes. However, when recovering from an outage 
with large backlog, sometimes subtasks from one or a few containers experienced 
super high back pressure. It took checkpoint barrier sometimes more than an 
hour to travel through the DAG to those heavy back pressured subtasks. Causes 
of the back pressure are likely due to multi-tenancy environment and 
performance variation among containers. Instead of letting checkpoint to time 
out in this case, we decided to increase checkpoint timeout value to crazy long 
value (like 2 hours). In theory, one could argue that we can set checkpoint 
timeout to infinity. It is always better to have a long but completed 
checkpoint than a timed out checkpoint, as timed out checkpoint basically give 
up its positions in the queue and new checkpoint just reset the positions back 
to the end of the queue . Note that we are using at least checkpoint semantics. 
So there is no barrier alignment concern. FLIP-76 (unaligned checkpoints) can 
help checkpoint dealing with back pressure better. It is not ready now and also 
has its limitations. We think `elapsedSecondsSinceLastCompletedCheckpoint` is 
very intuitive to set up alert against.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to