[
https://issues.apache.org/jira/browse/YUNIKORN-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092476#comment-17092476
]
Tao Yang commented on YUNIKORN-21:
----------------------------------
Thanks [~wangda] for the review.
Replies about these questions are as following:
{quote}
1) What is the difference between node-sorting policy and evaluator? If the
node-order is solely based on the evaluator, we can use one single component to
replace the two? (Or we only expose sorter interface and the evaluator becomes
implementation details).
{quote}
Node-sorting policy is just as before which only defines two policy (Fairness
and BinPacking) without implementations, evaluator is used to evaluate nodes
and give scores in different dimensions, we can have a simple evaluator as
before (just give a score whose value is the used/capacity ratio), or a more
complex evaluator based on multiple scores and weights. I think we can abstract
some common components from algorithms so that they are loose-coupled and can
be reused in different algorithms.
{quote}
2) The scope of incremental sorting algorithm is not very clear to me, are we
going to maintain a sorted list for every request? It might be too much if we
want to do it on a per-request basis (we could have many different requests).
{quote}
We are not going to maintain a sorted list for every request, we can just
maintain one common sorted list for every partition which would be enough for
most of requests without specified requirements for nodes(like affinity,
anti-affinity, best-fit). For the requests with specified requirements for
nodes, my initial thought is to handle them with different strategies, if few
nodes are need to be adjusted, we can get the common sorted list and rearrange
some nodes, or else we can directly resort nodes based on the merged
score(might be commonScore + dynamicScore), we can have a further discussion
for this.
{quote}
3) I don't quite sure about the "Weight" concept, are we going to support a
multi node scorer like K8s default scheduler? I personally don't prefer that
way, since a weighted result is not easy to explain the behavior.
{quote}
Yes, that's what I thought, I agree that it could be not easy to explain, but I
think it's flexible to be used for complex scenarios, for example, we may need
to consider GPU resource as the most important factor for requests with GPU
requirements, weight is a easy way to define which factor is more important and
the config value might be different for different clusters. I would like to
accept a better approach if anyone has.
{quote}
4) How fast we can do the resorting? Since node list is keep changing, and
node's status also changing fast, are we going to keep an always up-to-date
sorting result, or we will have some latencies. (If we need pre-sorted node
lists on a per-request basis, there're too many sorted node lists we need to
maintain).
{quote}
As the doc described, We can have multiple algorithms like
DefaultNodeSortingAlgorithm(sort all nodes at all time),
IncrementalNodeSortingAlgorithm(sort updated nodes incrementally) etc. , users
can choose which one to use or easily add customized algorithms themselves, for
IncrementalNodeSortingAlgorithm, it can leverage SubjectManager to keep
updated nodes and cached sorting result instead of always up-to-date sorting
result (as shown in the first sequence diagram in the doc), the resorting only
happens when scheduler is actually allocating for a specified request, and the
scheduling interval should be tiny and the updated nodes should be few at most
times, so that the resorting can be quite fast for common requests. The
scheduling throughput can be improved from 450 to 5000+ in a mock cluster with
1000 nodes according to the benchmark results of scheduler_perf_test.go in my
local test.
{quote}
1/2 are not request-related, 3 is request-related, I'm wondering how we deal
with these different use cases based on the proposal.
{quote}
For 1/2, users can leverage IncrementalNodeSortingAlgorithm and NodeEvaluator
which can flexibly defines multiple static scorers and their weights to get a
better performance , for 3, sorting all nodes for requests seems unavoidable,
DefaultNodeSortingAlgorithm and NodeEvaluator with one dynamic scorer(calculate
the fix score) is enough to be used. Make sense?
{quote}
Also, it will be important to make sure the node sorting policy can be used by
preemption logic.
{quote}
Agree about this.
> Revisit node sorting algorithm for fairness
> -------------------------------------------
>
> Key: YUNIKORN-21
> URL: https://issues.apache.org/jira/browse/YUNIKORN-21
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Wangda Tan
> Priority: Major
> Attachments: Improve node sorting algorithm v1.pdf, Improve node
> sorting algorithm v2.pdf
>
>
> Currently, we're using DominantRatio for the node sorting algorithm
> {code:java}
> func CompUsageShares(left, right *Resource) int {
> lshares := getShares(left,nil) rshares := getShares(right,nil)
> return compareShares(lshares, rshares)
> }{code}
> Which is not good, two reasons:
> # Dominate resource compare is about 8X more expensive than single float
> compares for two resource types.
> # Dominate resource is not stable when we have scarce resource types like
> GPU. A node with 192GB mem, 32 vcores, and 1 GPU available, compared to 168GB
> mem, 64 vcore and 8 GPU available; the prior one can go first because of the
> following logic:
> {code:java}
> if total == nil || total.Resources[k] == 0 {
> // negative share is logged
> if v < 0 {
> log.Logger().Debug("usage is negative no total, share is also negative",
> zap.Int64("resource quantity", int64(v)))
> }
> shares[idx] = float64(v) idx++ continue
> }{code}
> I think we should discard dominate resource compare for node resource.
> Instead, we just use one resource type (like vcores) to compare available
> resource.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]