[
https://issues.apache.org/jira/browse/FLINK-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413074#comment-17413074
]
Chesnay Schepler commented on FLINK-21510:
------------------------------------------
A fair amount because we have to completely decouple these metrics from the
ExecutionGraph (and thus, checkpoint coordinator) instances. Otherwise we lose
these metrics temporarily when the AdaptiveScheduler restarts a job (i.e., they
are not exposed for some time and we lose the current state).
Essentially we will need to do similar things like we do for task IO metrics;
create a set of metrics that should be updated, and pass them to the component
that should require them.
For the checkpoint coordinator this should work reasonably well (but we have to
double-check the semantics of all metrics).
The job status metrics currently poll information from the ExecutionGraph.
Ideally they would instead ask the Scheduler instead, so the AdaptiveScheduler
needs it's own timetamp data-structure. Depending on how FLINK-21513 turns out
the scheduler may also need to accumulate the durations of previous attempts
(so you can for example get do totalUptime/totalRunTime).
> ExecutionGraph metrics collide on restart
> -----------------------------------------
>
> Key: FLINK-21510
> URL: https://issues.apache.org/jira/browse/FLINK-21510
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Chesnay Schepler
> Priority: Minor
> Labels: auto-deprioritized-major, auto-unassigned, reactive
>
> The ExecutionGraphBuilder registers several metrics directly on the
> JobManagerJobMetricGroup, which are never cleaned up.
> These include upTime/DownTime/restartingTime as well as various checkpointing
> metrics (see the CheckpointStatsTracker for details; examples are number of
> checkpoints, checkpoint sizes etc).
> When the AdaptiveScheduler re-creates the EG these will collide with metrics
> of prior attempts.
> Essentially we either need to create a separate metric group that we pass to
> the EG or refactor the metrics to be based on some mutable EG reference.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)