[ 
https://issues.apache.org/jira/browse/YUNIKORN-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YUNIKORN-646.
----------------------------------
    Fix Version/s: 0.11
       Resolution: Fixed

> Add metrics implementation: "allocating_latency_seconds"
> --------------------------------------------------------
>
>                 Key: YUNIKORN-646
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-646
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: core - common
>            Reporter: Chenya Zhang
>            Assignee: Chenya Zhang
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.11
>
>
> Observation:
>  # Container allocating latency stays at 0. The number of allocation attempts 
> fluctuates normally.
>  # Scheduler metrics definition is not consistent and sometimes hard to 
> understand.
> Root cause analysis:
>  # The metrics "allocating_latency_seconds" is not fully implemented or the 
> implementation is missed in recent releases. For example, 
> ObserveSchedulingLatency() is currently not called when allocating containers.
>  # Scheduler metrics is implemented by multiple developers in the past while 
> not following the same convention.
> Improvement Plan:
>  # The top level container allocation latency can be captured by the main 
> scheduling routine in {{scheduler/context.go}}. Reason: The {{schedule()}} 
> method in {{scheduler/context.go}} is the entry point to process each 
> partition in the scheduler, walk over each queue and app to check if anything 
> can be scheduled.
>  # The metrics name "allocating_latency_seconds" can be changed to 
> "scheduling_latency_seconds". Reason: The metrics is initially defined as 
> "schedulingLatency" in {{metrics/scheduler.go}}. Naming consistency can help 
> to avoid confusion.
>  # Other metrics definition and help message can be improved to make 
> {{metrics/scheduler.go}} consistent. (Open to create a separate PR for the 
> refactoring work.)
>  # New metrics can be further added to monitor lower level latency when the 
> scheduler is iterating over partition list, queues, applications, requests 
> etc. Not included in this PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to