[jira] [Commented] (FLINK-21510) ExecutionGraph metrics collide on restart

Chesnay Schepler (Jira) Fri, 10 Sep 2021 02:37:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413074#comment-17413074
 ]


Chesnay Schepler commented on FLINK-21510:
------------------------------------------

A fair amount because we have to completely decouple these metrics from the 
ExecutionGraph (and thus, checkpoint coordinator) instances. Otherwise we lose 
these metrics temporarily when the AdaptiveScheduler restarts a job (i.e., they 
are not exposed for some time and we lose the current state).

Essentially we will need to do similar things like we do for task IO metrics; 
create a set of metrics that should be updated, and pass them to the component 
that should require them.

For the checkpoint coordinator this should work reasonably well (but we have to 
double-check the semantics of all metrics).

The job status metrics currently poll information from the ExecutionGraph. 
Ideally they would instead ask the Scheduler instead, so the AdaptiveScheduler 
needs it's own timetamp data-structure. Depending on how FLINK-21513 turns out 
the scheduler may also need to accumulate the durations of previous attempts 
(so you can for example get do totalUptime/totalRunTime).

> ExecutionGraph metrics collide on restart
> -----------------------------------------
>
>                 Key: FLINK-21510
>                 URL: https://issues.apache.org/jira/browse/FLINK-21510
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Chesnay Schepler
>            Priority: Minor
>              Labels: auto-deprioritized-major, auto-unassigned, reactive
>
> The ExecutionGraphBuilder registers several metrics directly on the 
> JobManagerJobMetricGroup, which are never cleaned up.
> These include upTime/DownTime/restartingTime as well as various checkpointing 
> metrics (see the CheckpointStatsTracker for details; examples are number of 
> checkpoints, checkpoint sizes etc).
> When the AdaptiveScheduler re-creates the EG these will collide with metrics 
> of prior attempts.
> Essentially we either need to create a separate metric group that we pass to 
> the EG or refactor the metrics to be based on some mutable EG reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21510) ExecutionGraph metrics collide on restart

Reply via email to