Re: YuniKorn Metrics

Tao Yang Thu, 15 Apr 2021 08:49:39 -0700

Hi, Chaoran

Sorry to be late for this response. Yes, We have did some performance tests
and found that the scheduling process is far from transparent at the
beginning, just as you said, the internal metrics is not good enough for us
to spot issues or locate bottlenecks. So we have tried to explored more
approaches to improve the visibility of scheduling process as following:
1) Broaden the horizon: the scheduling process is just one part in pod
lifecycle, we want to know all the phases in pod lifecycle and know exactly
where is the biggest bottleneck. And we indeed found some bottlenecks which
are much bigger elsewhere in APIServer, some CNI/CSI services or Kubelet,
via monitoring and parsing all key times (e.g.
create/scheduled/started/initialized/ready/containers-ready times) out from
every Pod, aggregating some data, showing them in charts of Grafana UI.
This helps a lot to locate the bottlenecks quickly in whole pod lifecycle.
2) Dig more details: use existing tracing framework (e.g. OpenTracing) to
collect tracing information in a standardized format for scheduling and
resource management, the traces are following the time and space sequence
of scheduling process, and can be collected periodically or on-demand to
help spotting issues. Please refer to YUNIKORN-387 for details, Weihao
Zheng will keep making effort to this feature.
3) We also developed a simple profiling tool which is easily to be injected
in any places and give a statistic report periodically or on-demand, so
that we can clearly see the performance details in any processes.


Hope this can help. Thanks.

Regards,
Tao

Chaoran Yu <[email protected]> 于2021年4月15日周四 上午4:04写道：

> Hello Tao,
>
> During our discussion with Wilfred yesterday, he mentioned that you folks
> at Alibaba have been running YuniKorn at some decent scale. We are also
> trying some big workloads (Spark batch jobs) with YuniKorn and would like
> to have better visibility in terms of the scheduling performance, and also
> create alerts to help us spot issues as soon as they happen. We found that
> the current list of metrics that are available in the core are not
> comprehensive and some seem to be incorrectly computed. We are reaching out
> to kindly ask you what metrics you have found to be most helpful? Or did
> you add some new metrics? A more generic question is how have you been
> monitoring YuniKorn? Many thanks in advance.
>
> If anyone else on the mailing list has ideas to chime in, that would be
> awesome too.
>
> Regards,
> Chaoran
>

Re: YuniKorn Metrics

Reply via email to