[jira] [Commented] (YUNIKORN-2678) Yunikorn does not appear to be considering Guaranteed resources when allocating Pending Pods.

Craig Condit (Jira) Tue, 06 Aug 2024 14:44:04 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871475#comment-17871475
 ]


Craig Condit commented on YUNIKORN-2678:
----------------------------------------

I also hadn't looked at this code for a very long time, but have been working 
closely with Paul to help develop the new solution. I'm convinced after combing 
through this that the entire shares[] comparison logic is utterly broken by 
design. There's simply no way to get consistent results out of it.  The shares 
are not returned in a consistent order, and the first value found (which could 
be pods, memory, cpu, ephemeral storage or really anything else) becomes the 
sort. In Paul's example it was ephemeral storage. So, queue with less usage 
along that dimension became the first queue always.

The logic currently also falls back to returning an absolute resource value in 
the case where no guarantee is set. This means you get either a fractional 
value (typically 0 to 1, but could be larger if queue is overcommited) and a 
lack of guarantee in another queue for that resource just returns the total 
usage (which is almost certainly orders of magnitude larger than a fraction. 
Even 2 bytes of RAM or 2 millicores will sort as larger. It's simply 
nonsensical to compare fractional usage values with absolute units.
{quote}Introducing max in the mix with guarantee will have side effects. I 
create a queue with max memory set to 1TB, no guaranteed. I create a second 
queue with max set to 1TB but a guaranteed memory of 100GB. Both queues use 
50GB. In that case share of queue 1 will be 0.05, queue 2 will have a share of 
0.5 Queue 1 will win and get scheduled until it uses 500GB, with a guaranteed 
of 0. Queue 1 should not have a smaller share than queue 2 until all guaranteed 
is used.

That looks as broken as what we have now.
{quote}
I disagree. Treating lack of (specific) guarantee as == max aligns with how 
this is already treated for the purposes of preemption. In that case, if no 
guarantee is supplied, the guarantee is treated as unlimited (or equal to max 
if you prefer). The result is that lack of guarantee doesn't allow a queue to 
either trigger preemption or *be preempted itself* (which is the part that 
works like guaranteed implicitly equalling max).

The proposed algorithm for fair sorting works similarly, and if you take your 
example using those semantics, it actually makes sense. Queue 1 has a guarantee 
of 100GB, Queue 2 has an *implicit* guarantee of 1TB. Both have *implicit* 
guarantees of _*any other resource type* {*}not mentioned as well{*}._ This 
ensures that even if guaranteed resources are not supplied (which is the 
default configuration), fair scheduling still works consistently.

The only edge case where things are slightly strange is if sibling queues have 
only partial guarantees applied. IMO this is already dubious as a 
configuration, though at least in the new algorithm it works consistently given 
the rules. The current implementation is practically random() sort order by 
comparison. 

 

> Yunikorn does not appear to be considering Guaranteed resources when 
> allocating Pending Pods.
> ---------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2678
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2678
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.5.1
>         Environment: EKS 1.29
>            Reporter: Paul Santa Clara
>            Assignee: Paul Santa Clara
>            Priority: Major
>         Attachments: Screenshot 2024-08-06 at 5.18.18 PM.png, Screenshot 
> 2024-08-06 at 5.18.21 PM.png, Screenshot 2024-08-06 at 5.18.30 PM.png, 
> jira-queues.yaml, jira-tier0-screenshot.png, jira-tier1-screenshot.png, 
> jira-tier2-screenshot.png, jira-tier3-screenshot.png
>
>
> Please see the attached queue configuration(jira-queues.yaml). 
> I will create 100 pods in Tier0, 100 pods in Tier1, 100 pods in Tier2 and 100 
> pods in Tier3.  Each Pod will require 1 VCore. Initially, there will be 0 
> suitable nodes to run the Pods and all will be Pending. Karpenter will soon 
> provision Nodes and Yunikorn will react by binding the Pods. 
> Given this 
> [code|https://github.com/apache/yunikorn-core/blob/a786feb5761be28e802d08976d224c40639cd86b/pkg/scheduler/objects/sorters.go#L81C74-L81C95],
>  I would expect Yunikorn to distribute the allocations such that each of the 
> Tier’ed queues reaches its Guarantees.  Instead, I observed a roughly even 
> distribution of allocation across all of the queues.
> Tier0 fails to meet its Gaurantees while Tier3, for instance, dramatically 
> overshoots them.
>  
> {code:java}
> > kubectl get pods -n finance | grep tier-0 | grep Pending | wc -l
>    86
> > kubectl get pods -n finance | grep tier-1 | grep Pending | wc -l
>    83
> > kubectl get pods -n finance | grep tier-2 | grep Pending | wc -l
>    78
> > kubectl get pods -n finance | grep tier-3 | grep Pending | wc -l
>    77
> {code}
> Please see attached screen shots for queue usage.
> Note, this situation can also be reproduced without the use of Karpenter by 
> simply setting Yunikorn's `service.schedulingInterval` to a high duration, 
> say 1m.  Doing so will force Yunikorn to react to 400 Pods -across 4 queues- 
> at roughly the same time forcing prioritization of queue allocations.
> Test code to generate Pods:
> {code:java}
> from kubernetes import client, config
> config.load_kube_config()
> v1 = client.CoreV1Api()
> def create_pod_manifest(tier, exec,):
>     pod_manifest = {
>         'apiVersion': 'v1',
>         'kind': 'Pod',
>         'metadata': {
>             'name': f"rolling-test-tier-{tier}-exec-{exec}",
>             'namespace': 'finance',
>             'labels': {
>                 'applicationId': f"MyOwnApplicationId-tier-{tier}",
>                 'queue': f"root.tiers.{tier}"
>             },
>             "yunikorn.apache.org/user.info": 
> '{"user":"system:serviceaccount:finance:spark","groups":["system:serviceaccounts","system:serviceaccounts:finance","system:authenticated"]}'
>         },
>         'spec': {
>             "affinity": {
>                 "nodeAffinity" : {
>                     "requiredDuringSchedulingIgnoredDuringExecution" : {
>                         "nodeSelectorTerms" : [
>                             {
>                                 "matchExpressions" : [
>                                     {
>                                         "key" : "di.rbx.com/dedicated",
>                                         "operator" : "In",
>                                         "values" : ["spark"]
>                                     }
>                                 ]
>                             }
>                         ]
>                     }
>                 },
>             },
>             "tolerations" : [
>                 {
>                     "effect" : "NoSchedule",
>                     "key": "dedicated",
>                     "operator" : "Equal",
>                     "value" : "spark"
>                 },
>             ],
>             "schedulerName": "yunikorn",
>             'restartPolicy': 'Always',
>             'containers': [{
>                 "name": "ubuntu",
>                 'image': 'ubuntu',
>                 "command": ["sleep", "604800"],
>                 "imagePullPolicy": "IfNotPresent",
>                 "resources" : {
>                     "limits" : {
>                         'cpu' : "1"
>                     },
>                     "requests" : {
>                         'cpu' : "1"
>                     }
>                 }
>             }]
>         }
>     }
>     return pod_manifest
> for i in range(0,4):
>     tier = str(i)
>     for j in range(0,100):
>         exec = str(j)
>         pod_manifest = create_pod_manifest(tier, exec)
>         print(pod_manifest)
>         api_response = v1.create_namespaced_pod(body=pod_manifest, 
> namespace="finance")
>         print(f"creating tier( {tier} ) exec( {exec} )")
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2678) Yunikorn does not appear to be considering Guaranteed resources when allocating Pending Pods.

Reply via email to