restartingTime metrics

Chesnay Schepler (Jira) Fri, 26 Feb 2021 02:02:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chesnay Schepler updated FLINK-21513:
-------------------------------------
    Description: 
While thinking about FLINK-21510 I stumbled upon some issues in the the 
semantics of these metrics, both from a user perspective and from our own, and 
I think we need to clarify some things.

h4. upTime

This metric describes the time since the job transitioned RUNNING state.
It is meant as a measure for how stably a deployment is.

In the default scheduler this transitions happens before we do any actual 
scheduling work, and as a result this also includes the time it takes for the 
JM to request slots and deploy tasks. In practive this means we start the timer 
once the job has been submitted and the JobMaster/Scheduler/EG have been 
initialized.

For the adaptive scheduler this now puts us a bit into an odd situation because 
it first acquires slots before actually transitioning the EG into a RUNNING 
state, so as is we'd end up measuring 2 slightly different things.

The question now is whether this is a problem.

While we could certainly stick with the definition of "time since EG switched 
to RUNNING", it raises the question what the semantics of this metric are 
should a scheduler use a different data-structure than the EG.
In other words, what I'm looking for is a definition that is independent from 
existing data-structures; a crude example could be "The time since the job is 
in a state where the deployment of a task is possible.".

An alternative for the adaptive scheduler would be to measure the time since we 
transitioned to WaitingForResources, with which we would also include the slot 
acquisition, but it would be inconsistent with the logs and UI (because they 
only display an INITIALIZING job).

h4. restartingTime

This metric describes the time since the job transitioned into a RESTARTING 
state.
It is meant as a measure for how long the recovery in case of a job failure 
takes.

In the default scheduler this in practice is the time between a failure 
arriving at the JM and the cancellation of tasks being completed / restart 
backoff (whichever is higher).

This is consistent with the semantics of the upTime metric, because upTime also 
includes the time required for acquiring slots and deploying tasks.

For the adaptive scheduler we can follow similar semantics, by measuring the 
time we spend in the {{Restarting}} state.
However, if we stick to the definition of upTime as time spent in RUNNING, then 
we will end up with a gap for the time spent in WaitingForResources.

h4. downTime

This metric describes the time between the job transitioning from FAILING to 
RUNNING.
It is meant as a measure for how long the recovery in case of a job failure 
takes.

You may be wondering what the difference between {{downTime}} and 
{{restartingTime}} is meant to be. Unfortunately I do not have the answer to 
that.
Presumably, at the time they were added, they were covering different parts of 
the recovery process, but since we never documented these steps explicitly the 
exact semantics are no longer clear and there are no specs that a scheduler can 
follow.

Furthermore, this metric is currently broken because a FAILING job _never_ 
transitions into RUNNING anymore.

The default scheduler transitions from RUNNING -> RESTARTING -> RUNNING, 
whereas the adaptive scheduler cancels the job and creates a new EG.

As it is we could probably merge downTime and restartingTime because they seem 
to cover the exact same thing.

  was:
While thinking about FLINK-21510 I stumbled upon some issues in the the 
semantics of these metrics, both from a user perspective and from our own, and 
I think we need to clarify some things.

h4. upTime

This metric describes the time since the job transitioned RUNNING state.
It is meant as a measure for how stably a deployment is.

In the default scheduler this transitions happens before we do any actual 
scheduling work, and as a result this also includes the time it takes for the 
JM to request slots and deploy tasks. In practive this means we start the timer 
once the job has been submitted and the JobMaster/Scheduler/EG have been 
initialized.

For the adaptive scheduler this now puts us a bit into an odd situation because 
it first acquires slots before actually transitioning the EG into a RUNNING 
state, so as is we'd end up measuring 2 slightly different things.

The question now is whether this is a problem.

While we could certainly stick with the definition of "time since EG switched 
to RUNNING", it raises the question what the semantics of this metric are 
should a scheduler use a different data-structure than the EG.
In other words, what I'm looking for is a definition that is independent from 
existing data-structures; a crude example could be "The time since the job is 
in a state where the deployment of a task is possible.".

An alternative for the adaptive scheduler would be to measure the time since we 
transitioned to WaitingForResources, with which we would also include the slot 
acquisition, but it would be inconsistent with the logs and UI (because they 
only display an INITIALIZING job).

h4. restartingTime

This metric describes the time since the job transitioned into a RESTARTING 
state.
It is meant as a measure for how long the recovery in case of a job failure 
takes.

In the default scheduler this in practice is the time between a failure 
arriving at the JM and the cancellation of tasks being completed / restart 
backoff (whichever is higher).

This is consistent with the semantics of the upTime metric, because upTime also 
includes the time required for acquiring slots and deploying tasks.

For the adaptive scheduler we can follow similar semantics, by measuring the 
time we spend in the {{Restarting}} state.
However, if we stick to the definition of upTime as time spent in RUNNING, then 
we will end up with a gap for the time spent in WaitingForResources.

h4. downTime

This metric describes the time between the job transitioning from FAILING to 
RUNNING.
It is meant as a measure for how long the recovery in case of a job failure 
takes.

You may be wondering what the difference between {{downTime}} and 
{{restartingTime}} is meant to be. Unfortunately I do not have the answer to 
that.
Presumably, at the time they were added, they were covering different parts of 
the recovery process, but since we never documented these steps explicitly the 
exact semantics are no longer clear and there are no specs that a scheduler can 
follow.

Furthermore, this metric is currently broken because a FAILING job _never_ 
transitions into RUNNING anymore.

The default scheduler transitions from RUNNING -> RESTARTING -> RUNNING, 
whereas the adaptive scheduler cancels the job and creates a new EG.



> Rethink up-/down-/restartingTime metrics
> ----------------------------------------
>
>                 Key: FLINK-21513
>                 URL: https://issues.apache.org/jira/browse/FLINK-21513
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Metrics
>            Reporter: Chesnay Schepler
>            Priority: Major
>             Fix For: 1.13.0
>
>
> While thinking about FLINK-21510 I stumbled upon some issues in the the 
> semantics of these metrics, both from a user perspective and from our own, 
> and I think we need to clarify some things.
> h4. upTime
> This metric describes the time since the job transitioned RUNNING state.
> It is meant as a measure for how stably a deployment is.
> In the default scheduler this transitions happens before we do any actual 
> scheduling work, and as a result this also includes the time it takes for the 
> JM to request slots and deploy tasks. In practive this means we start the 
> timer once the job has been submitted and the JobMaster/Scheduler/EG have 
> been initialized.
> For the adaptive scheduler this now puts us a bit into an odd situation 
> because it first acquires slots before actually transitioning the EG into a 
> RUNNING state, so as is we'd end up measuring 2 slightly different things.
> The question now is whether this is a problem.
> While we could certainly stick with the definition of "time since EG switched 
> to RUNNING", it raises the question what the semantics of this metric are 
> should a scheduler use a different data-structure than the EG.
> In other words, what I'm looking for is a definition that is independent from 
> existing data-structures; a crude example could be "The time since the job is 
> in a state where the deployment of a task is possible.".
> An alternative for the adaptive scheduler would be to measure the time since 
> we transitioned to WaitingForResources, with which we would also include the 
> slot acquisition, but it would be inconsistent with the logs and UI (because 
> they only display an INITIALIZING job).
> h4. restartingTime
> This metric describes the time since the job transitioned into a RESTARTING 
> state.
> It is meant as a measure for how long the recovery in case of a job failure 
> takes.
> In the default scheduler this in practice is the time between a failure 
> arriving at the JM and the cancellation of tasks being completed / restart 
> backoff (whichever is higher).
> This is consistent with the semantics of the upTime metric, because upTime 
> also includes the time required for acquiring slots and deploying tasks.
> For the adaptive scheduler we can follow similar semantics, by measuring the 
> time we spend in the {{Restarting}} state.
> However, if we stick to the definition of upTime as time spent in RUNNING, 
> then we will end up with a gap for the time spent in WaitingForResources.
> h4. downTime
> This metric describes the time between the job transitioning from FAILING to 
> RUNNING.
> It is meant as a measure for how long the recovery in case of a job failure 
> takes.
> You may be wondering what the difference between {{downTime}} and 
> {{restartingTime}} is meant to be. Unfortunately I do not have the answer to 
> that.
> Presumably, at the time they were added, they were covering different parts 
> of the recovery process, but since we never documented these steps explicitly 
> the exact semantics are no longer clear and there are no specs that a 
> scheduler can follow.
> Furthermore, this metric is currently broken because a FAILING job _never_ 
> transitions into RUNNING anymore.
> The default scheduler transitions from RUNNING -> RESTARTING -> RUNNING, 
> whereas the adaptive scheduler cancels the job and creates a new EG.
> As it is we could probably merge downTime and restartingTime because they 
> seem to cover the exact same thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-21513) Rethink up-/down-/restartingTime metrics

Reply via email to