[
https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092781#comment-17092781
]
Wangda Tan commented on YUNIKORN-21:
------------------------------------
Thanks [~Tao Yang], the answer looks reasonable to me, and the number also
looks very promising.
bq. The scheduling throughput can be improved from 450 to 5000+ in a mock
cluster with 1000 nodes according to the benchmark results of
scheduler_perf_test.go in my local test.
Reread the design doc. I think I can understand it better now. From high-level,
class design makes sense, and if you can have a PoC patch, I can help with
review and give more detailed suggestions.
After another thought, I think the weighted sort policy may make sense if the
number of scorers involved is small (no more than 3), I want to avoid 20
scorers involved and give a weighted score which we cannot explain at all
> Revisit node sorting algorithm for fairness
> -------------------------------------------
>
> Key: YUNIKORN-21
> URL: https://issues.apache.org/jira/browse/YUNIKORN-21
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Wangda Tan
> Priority: Major
> Attachments: Improve node sorting algorithm v1.pdf, Improve node
> sorting algorithm v2.pdf
>
>
> Currently, we're using DominantRatio for the node sorting algorithm
> {code:java}
> func CompUsageShares(left, right *Resource) int {
> lshares := getShares(left,nil) rshares := getShares(right,nil)
> return compareShares(lshares, rshares)
> }{code}
> Which is not good, two reasons:
> # Dominate resource compare is about 8X more expensive than single float
> compares for two resource types.
> # Dominate resource is not stable when we have scarce resource types like
> GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB
> mem, 64 vcore and 8 GPU available; the prior one can go first because of the
> following logic:
> {code:java}
> if total == nil || total.Resources[k] == 0 {
> // negative share is logged
> if v < 0 {
> log.Logger().Debug("usage is negative no total, share is also negative",
> zap.Int64("resource quantity", int64(v)))
> }
> shares[idx] = float64(v) idx++ continue
> }{code}
> I think we should discard dominate resource compare for node resource.
> Instead, we just use one resource type (like vcores) to compare available
> resource.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]