restartingTime metrics

Chesnay Schepler (Jira) Wed, 09 Feb 2022 04:17:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chesnay Schepler updated FLINK-21513:
-------------------------------------
    Fix Version/s: 1.16.0

> Rethink up-/down-/restartingTime metrics
> ----------------------------------------
>
>                 Key: FLINK-21513
>                 URL: https://issues.apache.org/jira/browse/FLINK-21513
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Metrics
>            Reporter: Chesnay Schepler
>            Priority: Minor
>              Labels: auto-deprioritized-major, reactive
>             Fix For: 1.16.0
>
>
> While thinking about FLINK-21510 I stumbled upon some issues in the the 
> semantics of these metrics, both from a user perspective and from our own, 
> and I think we need to clarify some things.
> h4. upTime
> This metric describes the time since the job transitioned RUNNING state.
> It is meant as a measure for how stably a deployment is.
> In the default scheduler this transitions happens before we do any actual 
> scheduling work, and as a result this also includes the time it takes for the 
> JM to request slots and deploy tasks. In practive this means we start the 
> timer once the job has been submitted and the JobMaster/Scheduler/EG have 
> been initialized.
> For the adaptive scheduler this now puts us a bit into an odd situation 
> because it first acquires slots before actually transitioning the EG into a 
> RUNNING state, so as is we'd end up measuring 2 slightly different things.
> The question now is whether this is a problem.
> While we could certainly stick with the definition of "time since EG switched 
> to RUNNING", it raises the question what the semantics of this metric are 
> should a scheduler use a different data-structure than the EG.
> In other words, what I'm looking for is a definition that is independent from 
> existing data-structures; a crude example could be "The time since the job is 
> in a state where the deployment of a task is possible.".
> An alternative for the adaptive scheduler would be to measure the time since 
> we transitioned to WaitingForResources, with which we would also include the 
> slot acquisition, but it would be inconsistent with the logs and UI (because 
> they only display an INITIALIZING job).
> h4. restartingTime
> This metric describes the time since the job transitioned into a RESTARTING 
> state.
> It is meant as a measure for how long the recovery in case of a job failure 
> takes.
> In the default scheduler this in practice is the time between a failure 
> arriving at the JM and the cancellation of tasks being completed / restart 
> backoff (whichever is higher).
> This is consistent with the semantics of the upTime metric, because upTime 
> also includes the time required for acquiring slots and deploying tasks.
> For the adaptive scheduler we can follow similar semantics, by measuring the 
> time we spend in the {{Restarting}} state.
> However, if we stick to the definition of upTime as time spent in RUNNING, 
> then we will end up with a gap for the time spent in WaitingForResources.
> h4. downTime
> This metric describes the time between the job transitioning from FAILING to 
> RUNNING.
> It is meant as a measure for how long the recovery in case of a job failure 
> takes.
> You may be wondering what the difference between {{downTime}} and 
> {{restartingTime}} is meant to be. Unfortunately I do not have the answer to 
> that.
> Presumably, at the time they were added, they were covering different parts 
> of the recovery process, but since we never documented these steps explicitly 
> the exact semantics are no longer clear and there are no specs that a 
> scheduler can follow.
> Furthermore, this metric is currently broken because a FAILING job _never_ 
> transitions into RUNNING anymore.
> The default scheduler transitions from RUNNING -> RESTARTING -> RUNNING, 
> whereas the adaptive scheduler cancels the job and creates a new EG.
> As it is we could probably merge downTime and restartingTime because they 
> seem to cover the exact same thing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (FLINK-21513) Rethink up-/down-/restartingTime metrics

Reply via email to