[ 
https://issues.apache.org/jira/browse/FLINK-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-12373:
-----------------------------------
    Labels: stale-minor  (was: )

> Improve checkpointing metrics
> -----------------------------
>
>                 Key: FLINK-12373
>                 URL: https://issues.apache.org/jira/browse/FLINK-12373
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing
>            Reporter: Gyula Fora
>            Priority: Minor
>              Labels: stale-minor
>
> The checkpoint metrics encapsulated in the CheckpointMetrics class currently 
> exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync 
> duration and async duration
> I think it would be a great improvement to break up the tracking of the sync 
> duration into the different components as it contains information that is 
> critical to improve the SLA of large jobs.
> I suggest we break up the sync duration into 4 subcomponents:
>  1. prepareSnapshotPreBarrier
>  2. Snapshot timers
>  3. Snapshot operator states
>  4. Sync keyed state checkpoint
> Maybe the operator state part could be further broken up into keyed/non-keyed 
> part, i dont know.
> I think knowing these metrics is crucial for users to minimise the latency 
> caused by checkpointing.
> Whether we want to show all this info on the web ui is another discussion :)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to