[
https://issues.apache.org/jira/browse/FLINK-33695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski closed FLINK-33695.
----------------------------------
Resolution: Fixed
> FLIP-384: Introduce TraceReporter and use it to create checkpointing and
> recovery traces
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-33695
> URL: https://issues.apache.org/jira/browse/FLINK-33695
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Checkpointing, Runtime / Metrics
> Reporter: Piotr Nowojski
> Assignee: Piotr Nowojski
> Priority: Major
> Fix For: 1.19.0
>
>
> https://cwiki.apache.org/confluence/x/TguZE
> *Motivation*
> Currently Flink has a limited observability of checkpoint and recovery
> processes.
> For checkpointing Flink has a very detailed overview in the Flink WebUI,
> which works great in many use cases, however it’s problematic if one is
> operating multiple Flink clusters, or if cluster/JM dies. Additionally there
> are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize),
> however those metrics have a couple of issues:
> * They are reported and refreshed periodically, depending on the
> MetricReporter settings, which doesn’t take into account checkpointing
> frequency.
> ** If checkpointing interval > metric reporting interval, we would be
> reporting the same values multiple times.
> ** If checkpointing interval < metric reporting interval, we would be
> randomly dropping metrics for some of the checkpoints.
> For recovery we are missing even the most basic of the metrics and Flink
> WebUI support. Also given the fact that recovery is even less frequent
> compared to checkpoints, adding recovery metrics would have even bigger
> problems with unnecessary reporting the same values.
> In this FLIP I’m proposing to add support for reporting traces/spans
> (example: Traces) and use this mechanism to report checkpointing and recovery
> traces. I hope in the future traces will also prove useful in other areas of
> Flink like job submission, job state changes, ... . Moreover as the API to
> report traces will be added to the MetricGroup , users will be also able to
> access this API.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)