Hi, Chaoran Sorry to be late for this response. Yes, We have did some performance tests and found that the scheduling process is far from transparent at the beginning, just as you said, the internal metrics is not good enough for us to spot issues or locate bottlenecks. So we have tried to explored more approaches to improve the visibility of scheduling process as following: 1) Broaden the horizon: the scheduling process is just one part in pod lifecycle, we want to know all the phases in pod lifecycle and know exactly where is the biggest bottleneck. And we indeed found some bottlenecks which are much bigger elsewhere in APIServer, some CNI/CSI services or Kubelet, via monitoring and parsing all key times (e.g. create/scheduled/started/initialized/ready/containers-ready times) out from every Pod, aggregating some data, showing them in charts of Grafana UI. This helps a lot to locate the bottlenecks quickly in whole pod lifecycle. 2) Dig more details: use existing tracing framework (e.g. OpenTracing) to collect tracing information in a standardized format for scheduling and resource management, the traces are following the time and space sequence of scheduling process, and can be collected periodically or on-demand to help spotting issues. Please refer to YUNIKORN-387 for details, Weihao Zheng will keep making effort to this feature. 3) We also developed a simple profiling tool which is easily to be injected in any places and give a statistic report periodically or on-demand, so that we can clearly see the performance details in any processes.
Hope this can help. Thanks. Regards, Tao Chaoran Yu <[email protected]> 于2021年4月15日周四 上午4:04写道: > Hello Tao, > > During our discussion with Wilfred yesterday, he mentioned that you folks > at Alibaba have been running YuniKorn at some decent scale. We are also > trying some big workloads (Spark batch jobs) with YuniKorn and would like > to have better visibility in terms of the scheduling performance, and also > create alerts to help us spot issues as soon as they happen. We found that > the current list of metrics that are available in the core are not > comprehensive and some seem to be incorrectly computed. We are reaching out > to kindly ask you what metrics you have found to be most helpful? Or did > you add some new metrics? A more generic question is how have you been > monitoring YuniKorn? Many thanks in advance. > > If anyone else on the mailing list has ideas to chime in, that would be > awesome too. > > Regards, > Chaoran >
