[ 
https://issues.apache.org/jira/browse/YUNIKORN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871722#comment-17871722
 ] 

Craig Condit edited comment on YUNIKORN-2678 at 8/7/24 4:19 PM:
----------------------------------------------------------------

I think we actually do want cluster max in the absence of configured values. 
This gives us the greatest chance of having a non-zero value to use as a 
denominator. Since we're only comparing usage shares of sibling queues, they 
will have the same denominator and it gives us a reasonable basis for 
comparison. It also aligns with how max actually behaves – if a max is not 
configured on a queue, then it is effectively limited to the maximum resources 
available in the cluster. Using that as a denominator for the usage percentage 
makes perfect sense to me.

In fact I think not doing so would be broken in an out-of-the-box scenario. 
Consider a cluster with 10 CPUs and 100 GB RAM, and two sibling queues 
root.queueA and root.queueB. If no max and no guarantees are provided at all 
(the default config), each queue could consume up to 10 CPUs and 100 GB RAM. If 
we ignore cluster max, we have zero denominator for both, and both will resolve 
to a usage share of zero, devolving into random (or more likely always favoring 
one first) behavior. This is broken.


was (Author: ccondit):
I think we actually do want cluster max in the absence of configured values. 
This gives us the greatest chance of having a non-zero value to use as a 
denominator. Since we're only comparing usage shares of sibling queues, they 
will have the same denominator and it gives us a reasonable basis for 
comparison. It also aligns with how max actually behaves – if a max is not 
configured on a queue, then it is effectively limited to the maximum resources 
available in the cluster. Using that as a denominator for the usage percentage 
makes perfect sense to me.

> Yunikorn does not appear to be considering Guaranteed resources when 
> allocating Pending Pods.
> ---------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2678
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2678
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.5.1
>         Environment: EKS 1.29
>            Reporter: Paul Santa Clara
>            Assignee: Paul Santa Clara
>            Priority: Major
>         Attachments: Screenshot 2024-08-06 at 5.18.18 PM.png, Screenshot 
> 2024-08-06 at 5.18.21 PM.png, Screenshot 2024-08-06 at 5.18.30 PM.png, 
> jira-queues.yaml, jira-tier0-screenshot.png, jira-tier1-screenshot.png, 
> jira-tier2-screenshot.png, jira-tier3-screenshot.png
>
>
> Please see the attached queue configuration(jira-queues.yaml). 
> I will create 100 pods in Tier0, 100 pods in Tier1, 100 pods in Tier2 and 100 
> pods in Tier3.  Each Pod will require 1 VCore. Initially, there will be 0 
> suitable nodes to run the Pods and all will be Pending. Karpenter will soon 
> provision Nodes and Yunikorn will react by binding the Pods. 
> Given this 
> [code|https://github.com/apache/yunikorn-core/blob/a786feb5761be28e802d08976d224c40639cd86b/pkg/scheduler/objects/sorters.go#L81C74-L81C95],
>  I would expect Yunikorn to distribute the allocations such that each of the 
> Tier’ed queues reaches its Guarantees.  Instead, I observed a roughly even 
> distribution of allocation across all of the queues.
> Tier0 fails to meet its Gaurantees while Tier3, for instance, dramatically 
> overshoots them.
>  
> {code:java}
> > kubectl get pods -n finance | grep tier-0 | grep Pending | wc -l
>    86
> > kubectl get pods -n finance | grep tier-1 | grep Pending | wc -l
>    83
> > kubectl get pods -n finance | grep tier-2 | grep Pending | wc -l
>    78
> > kubectl get pods -n finance | grep tier-3 | grep Pending | wc -l
>    77
> {code}
> Please see attached screen shots for queue usage.
> Note, this situation can also be reproduced without the use of Karpenter by 
> simply setting Yunikorn's `service.schedulingInterval` to a high duration, 
> say 1m.  Doing so will force Yunikorn to react to 400 Pods -across 4 queues- 
> at roughly the same time forcing prioritization of queue allocations.
> Test code to generate Pods:
> {code:java}
> from kubernetes import client, config
> config.load_kube_config()
> v1 = client.CoreV1Api()
> def create_pod_manifest(tier, exec,):
>     pod_manifest = {
>         'apiVersion': 'v1',
>         'kind': 'Pod',
>         'metadata': {
>             'name': f"rolling-test-tier-{tier}-exec-{exec}",
>             'namespace': 'finance',
>             'labels': {
>                 'applicationId': f"MyOwnApplicationId-tier-{tier}",
>                 'queue': f"root.tiers.{tier}"
>             },
>             "yunikorn.apache.org/user.info": 
> '{"user":"system:serviceaccount:finance:spark","groups":["system:serviceaccounts","system:serviceaccounts:finance","system:authenticated"]}'
>         },
>         'spec': {
>             "affinity": {
>                 "nodeAffinity" : {
>                     "requiredDuringSchedulingIgnoredDuringExecution" : {
>                         "nodeSelectorTerms" : [
>                             {
>                                 "matchExpressions" : [
>                                     {
>                                         "key" : "di.rbx.com/dedicated",
>                                         "operator" : "In",
>                                         "values" : ["spark"]
>                                     }
>                                 ]
>                             }
>                         ]
>                     }
>                 },
>             },
>             "tolerations" : [
>                 {
>                     "effect" : "NoSchedule",
>                     "key": "dedicated",
>                     "operator" : "Equal",
>                     "value" : "spark"
>                 },
>             ],
>             "schedulerName": "yunikorn",
>             'restartPolicy': 'Always',
>             'containers': [{
>                 "name": "ubuntu",
>                 'image': 'ubuntu',
>                 "command": ["sleep", "604800"],
>                 "imagePullPolicy": "IfNotPresent",
>                 "resources" : {
>                     "limits" : {
>                         'cpu' : "1"
>                     },
>                     "requests" : {
>                         'cpu' : "1"
>                     }
>                 }
>             }]
>         }
>     }
>     return pod_manifest
> for i in range(0,4):
>     tier = str(i)
>     for j in range(0,100):
>         exec = str(j)
>         pod_manifest = create_pod_manifest(tier, exec)
>         print(pod_manifest)
>         api_response = v1.create_namespaced_pod(body=pod_manifest, 
> namespace="finance")
>         print(f"creating tier( {tier} ) exec( {exec} )")
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to