zentol opened a new pull request #18566:
URL: https://github.com/apache/flink/pull/18566


   FLINK-23976 added standardized metrics for capturing how much time we spend 
in each JobStatus. However, certain states in practice consist of several 
stages; for example the RUNNING state also includes the deployment of tasks.
   
   To get a better picture on where time is spent I propose to add new metrics 
that capture the deployingTime based on the execution states. This will 
additionally get us closer to a proper uptime metric, which ideally will be 
runningTime - various stage time metrics.
   
   A job is considered to be deploying,
   
       for batch jobs, if no task is running and at least one task is being 
deployed
       for streaming jobs, if at least one task is being deployed
   
   The semantics are different for batch/streaming jobs because they differ in 
terms of how they make progress. For a streaming job all tasks need to be 
deployed for checkpointing to make work. For batch jobs any deployed task 
immediately starts progressing the job.
   
   
   I will add documentation later once we have agreed on the semantics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to