[
https://issues.apache.org/jira/browse/FLINK-33695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski updated FLINK-33695:
-----------------------------------
Description:
https://cwiki.apache.org/confluence/x/TguZE
*Motivation*
Currently Flink has a limited observability of checkpoint and recovery
processes.
For checkpointing Flink has a very detailed overview in the Flink WebUI, which
works great in many use cases, however it’s problematic if one is operating
multiple Flink clusters, or if cluster/JM dies. Additionally there are a couple
of metrics (like lastCheckpointDuration or lastCheckpointSize), however those
metrics have a couple of issues:
* They are reported and refreshed periodically, depending on the MetricReporter
settings, which doesn’t take into account checkpointing frequency.
** If checkpointing interval > metric reporting interval, we would be reporting
the same values multiple times.
** If checkpointing interval < metric reporting interval, we would be randomly
dropping metrics for some of the checkpoints.
For recovery we are missing even the most basic of the metrics and Flink WebUI
support. Also given the fact that recovery is even less frequent compared to
checkpoints, adding recovery metrics would have even bigger problems with
unnecessary reporting the same values.
In this FLIP I’m proposing to add support for reporting traces/spans (example:
Traces) and use this mechanism to report checkpointing and recovery traces. I
hope in the future traces will also prove useful in other areas of Flink like
job submission, job state changes, ... . Moreover as the API to report traces
will be added to the MetricGroup , users will be also able to access this API.
was:
https://cwiki.apache.org/confluence/x/TguZE
*Motivation*
Currently Flink has a limited observability of checkpoint and recovery
processes.
For checkpointing Flink has a very detailed overview in the Flink WebUI, which
works great in many use cases, however it’s problematic if one is operating
multiple Flink clusters, or if cluster/JM dies. Additionally there are a couple
of metrics (like lastCheckpointDuration or lastCheckpointSize), however those
metrics have a couple of issues:
They are reported and refreshed periodically, depending on the MetricReporter
settings, which doesn’t take into account checkpointing frequency.
If checkpointing interval > metric reporting interval, we would be reporting
the same values multiple times.
If checkpointing interval < metric reporting interval, we would be randomly
dropping metrics for some of the checkpoints.
For recovery we are missing even the most basic of the metrics and Flink WebUI
support. Also given the fact that recovery is even less frequent compared to
checkpoints, adding recovery metrics would have even bigger problems with
unnecessary reporting the same values.
In this FLIP I’m proposing to add support for reporting traces/spans (example:
Traces) and use this mechanism to report checkpointing and recovery traces. I
hope in the future traces will also prove useful in other areas of Flink like
job submission, job state changes, ... . Moreover as the API to report traces
will be added to the MetricGroup , users will be also able to access this API.
> FLIP-384: Introduce TraceReporter and use it to create checkpointing and
> recovery traces
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-33695
> URL: https://issues.apache.org/jira/browse/FLINK-33695
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Checkpointing, Runtime / Metrics
> Reporter: Piotr Nowojski
> Assignee: Piotr Nowojski
> Priority: Major
> Fix For: 1.19.0
>
>
> https://cwiki.apache.org/confluence/x/TguZE
> *Motivation*
> Currently Flink has a limited observability of checkpoint and recovery
> processes.
> For checkpointing Flink has a very detailed overview in the Flink WebUI,
> which works great in many use cases, however it’s problematic if one is
> operating multiple Flink clusters, or if cluster/JM dies. Additionally there
> are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize),
> however those metrics have a couple of issues:
> * They are reported and refreshed periodically, depending on the
> MetricReporter settings, which doesn’t take into account checkpointing
> frequency.
> ** If checkpointing interval > metric reporting interval, we would be
> reporting the same values multiple times.
> ** If checkpointing interval < metric reporting interval, we would be
> randomly dropping metrics for some of the checkpoints.
> For recovery we are missing even the most basic of the metrics and Flink
> WebUI support. Also given the fact that recovery is even less frequent
> compared to checkpoints, adding recovery metrics would have even bigger
> problems with unnecessary reporting the same values.
> In this FLIP I’m proposing to add support for reporting traces/spans
> (example: Traces) and use this mechanism to report checkpointing and recovery
> traces. I hope in the future traces will also prove useful in other areas of
> Flink like job submission, job state changes, ... . Moreover as the API to
> report traces will be added to the MetricGroup , users will be also able to
> access this API.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)