[ 
https://issues.apache.org/jira/browse/FLINK-33695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-33695:
-----------------------------------
    Description: 
https://cwiki.apache.org/confluence/x/TguZE

*Motivation*
Currently Flink has a limited observability of checkpoint and recovery 
processes.

For checkpointing Flink has a very detailed overview in the Flink WebUI, which 
works great in many use cases, however it’s problematic if one is operating 
multiple Flink clusters, or if cluster/JM dies. Additionally there are a couple 
of metrics (like lastCheckpointDuration or lastCheckpointSize), however those 
metrics have a couple of issues:

They are reported and refreshed periodically, depending on the MetricReporter 
settings, which doesn’t take into account checkpointing frequency.

If checkpointing interval > metric reporting interval, we would be reporting 
the same values multiple times.

If checkpointing interval < metric reporting interval, we would be randomly 
dropping metrics for some of the checkpoints.

For recovery we are missing even the most basic of the metrics and Flink WebUI 
support. Also given the fact that recovery is even less frequent compared to 
checkpoints, adding recovery metrics would have even bigger problems with 
unnecessary reporting the same values.

In this FLIP I’m proposing to add support for reporting traces/spans (example: 
Traces) and use this mechanism to report checkpointing and recovery traces. I 
hope in the future traces will also prove useful in other areas of Flink like 
job submission, job state changes, ... . Moreover as the API to report traces 
will be added to the MetricGroup , users will be also able to access this API. 

> FLIP-384: Introduce TraceReporter and use it to create checkpointing and 
> recovery traces
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-33695
>                 URL: https://issues.apache.org/jira/browse/FLINK-33695
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing, Runtime / Metrics
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Major
>             Fix For: 1.19.0
>
>
> https://cwiki.apache.org/confluence/x/TguZE
> *Motivation*
> Currently Flink has a limited observability of checkpoint and recovery 
> processes.
> For checkpointing Flink has a very detailed overview in the Flink WebUI, 
> which works great in many use cases, however it’s problematic if one is 
> operating multiple Flink clusters, or if cluster/JM dies. Additionally there 
> are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize), 
> however those metrics have a couple of issues:
> They are reported and refreshed periodically, depending on the MetricReporter 
> settings, which doesn’t take into account checkpointing frequency.
> If checkpointing interval > metric reporting interval, we would be reporting 
> the same values multiple times.
> If checkpointing interval < metric reporting interval, we would be randomly 
> dropping metrics for some of the checkpoints.
> For recovery we are missing even the most basic of the metrics and Flink 
> WebUI support. Also given the fact that recovery is even less frequent 
> compared to checkpoints, adding recovery metrics would have even bigger 
> problems with unnecessary reporting the same values.
> In this FLIP I’m proposing to add support for reporting traces/spans 
> (example: Traces) and use this mechanism to report checkpointing and recovery 
> traces. I hope in the future traces will also prove useful in other areas of 
> Flink like job submission, job state changes, ... . Moreover as the API to 
> report traces will be added to the MetricGroup , users will be also able to 
> access this API. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to