[
https://issues.apache.org/jira/browse/FLINK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-17531:
-----------------------------------
Labels: auto-deprioritized-major auto-deprioritized-minor (was:
auto-deprioritized-major stale-minor)
Priority: Not a Priority (was: Minor)
This issue was labeled "stale-minor" 7 days ago and has not received any
updates so it is being deprioritized. If this ticket is actually Minor, please
raise the priority and ask a committer to assign you the issue or revive the
public discussion.
> Add a new checkpoint Guage metric: elapsedTimeSinceLastCompletedCheckpoint
> --------------------------------------------------------------------------
>
> Key: FLINK-17531
> URL: https://issues.apache.org/jira/browse/FLINK-17531
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Checkpointing
> Affects Versions: 1.10.0
> Reporter: Steven Zhen Wu
> Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
>
> like to discuss the value of a new checkpoint Guage metric:
> `elapsedTimeSinceLastCompletedCheckpoint`. Main motivation is for alerting. I
> know reasons below are somewhat related to our setup. Hence want to explore
> the interest of the community.
>
> *What do we want to achieve?*
> We want to alert if no successful checkpoint happened for a specific period.
> With this new metric, we can set up a simple alerting rule like `alert if
> elapsedTimeSinceLastCompletedCheckpoint > N minutes`. It is a good alerting
> pattern of `time since last success`. We found
> `elapsedTimeSinceLastCompletedCheckpoint` metric very easy and intuitive to
> set up alert against.
>
> *What out existing checkpoint metrics?*
> `numberOfCompletedCheckpoints`. We can set up an alert like `alert if
> derivative(numberOfCompletedCheckpoints) == 0 for N minutes`. However, it is
> an anti-pattern for our alerting system, as it is looking for lack of good
> signal (vs explicit bad signal). Such an anti-pattern is easier to suffer
> false alarm problem when there is occasional metric drop or alerting system
> processing issue.
>
> `numberOfFailedCheckpoints`. That is an explicit failure signal, which is
> good. We can set up alert like `alert if
> derivative(numberOfFailedCheckpoints) > 0 in X out Y minutes`. We have some
> high-parallelism large-state jobs. Their normal checkpoint duration is <1-2
> minutes. However, when recovering from an outage with large backlog,
> sometimes subtasks from one or a few containers experienced super high back
> pressure. It took checkpoint barrier sometimes more than an hour to travel
> through the DAG to those heavy back pressured subtasks. Causes of the back
> pressure are likely due to multi-tenancy environment and performance
> variation among containers. Instead of letting checkpoint to time out in this
> case, we decided to increase checkpoint timeout value to crazy long value
> (like 2 hours). With that, we kind of missed the explicit "bad" signal of
> failed/timed out checkpoint.
>
> In theory, one could argue that we can set checkpoint timeout to infinity. It
> is always better to have a long but completed checkpoint than a timed out
> checkpoint, as timed out checkpoint basically give up its positions in the
> queue and new checkpoint just reset the positions back to the end of the
> queue . Note that we are using at least checkpoint semantics. So there is no
> barrier alignment concern. FLIP-76 (unaligned checkpoints) can help
> checkpoint dealing with back pressure better. It is not ready now and also
> has its limitations. That is a separate discussion.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)