[
https://issues.apache.org/jira/browse/FLINK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chesnay Schepler updated FLINK-21513:
-------------------------------------
Fix Version/s: 1.16.0
> Rethink up-/down-/restartingTime metrics
> ----------------------------------------
>
> Key: FLINK-21513
> URL: https://issues.apache.org/jira/browse/FLINK-21513
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Metrics
> Reporter: Chesnay Schepler
> Priority: Minor
> Labels: auto-deprioritized-major, reactive
> Fix For: 1.16.0
>
>
> While thinking about FLINK-21510 I stumbled upon some issues in the the
> semantics of these metrics, both from a user perspective and from our own,
> and I think we need to clarify some things.
> h4. upTime
> This metric describes the time since the job transitioned RUNNING state.
> It is meant as a measure for how stably a deployment is.
> In the default scheduler this transitions happens before we do any actual
> scheduling work, and as a result this also includes the time it takes for the
> JM to request slots and deploy tasks. In practive this means we start the
> timer once the job has been submitted and the JobMaster/Scheduler/EG have
> been initialized.
> For the adaptive scheduler this now puts us a bit into an odd situation
> because it first acquires slots before actually transitioning the EG into a
> RUNNING state, so as is we'd end up measuring 2 slightly different things.
> The question now is whether this is a problem.
> While we could certainly stick with the definition of "time since EG switched
> to RUNNING", it raises the question what the semantics of this metric are
> should a scheduler use a different data-structure than the EG.
> In other words, what I'm looking for is a definition that is independent from
> existing data-structures; a crude example could be "The time since the job is
> in a state where the deployment of a task is possible.".
> An alternative for the adaptive scheduler would be to measure the time since
> we transitioned to WaitingForResources, with which we would also include the
> slot acquisition, but it would be inconsistent with the logs and UI (because
> they only display an INITIALIZING job).
> h4. restartingTime
> This metric describes the time since the job transitioned into a RESTARTING
> state.
> It is meant as a measure for how long the recovery in case of a job failure
> takes.
> In the default scheduler this in practice is the time between a failure
> arriving at the JM and the cancellation of tasks being completed / restart
> backoff (whichever is higher).
> This is consistent with the semantics of the upTime metric, because upTime
> also includes the time required for acquiring slots and deploying tasks.
> For the adaptive scheduler we can follow similar semantics, by measuring the
> time we spend in the {{Restarting}} state.
> However, if we stick to the definition of upTime as time spent in RUNNING,
> then we will end up with a gap for the time spent in WaitingForResources.
> h4. downTime
> This metric describes the time between the job transitioning from FAILING to
> RUNNING.
> It is meant as a measure for how long the recovery in case of a job failure
> takes.
> You may be wondering what the difference between {{downTime}} and
> {{restartingTime}} is meant to be. Unfortunately I do not have the answer to
> that.
> Presumably, at the time they were added, they were covering different parts
> of the recovery process, but since we never documented these steps explicitly
> the exact semantics are no longer clear and there are no specs that a
> scheduler can follow.
> Furthermore, this metric is currently broken because a FAILING job _never_
> transitions into RUNNING anymore.
> The default scheduler transitions from RUNNING -> RESTARTING -> RUNNING,
> whereas the adaptive scheduler cancels the job and creates a new EG.
> As it is we could probably merge downTime and restartingTime because they
> seem to cover the exact same thing.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)