[
https://issues.apache.org/jira/browse/FLINK-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-12373:
-----------------------------------
Labels: stale-minor (was: )
> Improve checkpointing metrics
> -----------------------------
>
> Key: FLINK-12373
> URL: https://issues.apache.org/jira/browse/FLINK-12373
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Checkpointing
> Reporter: Gyula Fora
> Priority: Minor
> Labels: stale-minor
>
> The checkpoint metrics encapsulated in the CheckpointMetrics class currently
> exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync
> duration and async duration
> I think it would be a great improvement to break up the tracking of the sync
> duration into the different components as it contains information that is
> critical to improve the SLA of large jobs.
> I suggest we break up the sync duration into 4 subcomponents:
> 1. prepareSnapshotPreBarrier
> 2. Snapshot timers
> 3. Snapshot operator states
> 4. Sync keyed state checkpoint
> Maybe the operator state part could be further broken up into keyed/non-keyed
> part, i dont know.
> I think knowing these metrics is crucial for users to minimise the latency
> caused by checkpointing.
> Whether we want to show all this info on the web ui is another discussion :)
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)