[
https://issues.apache.org/jira/browse/YUNIKORN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870913#comment-17870913
]
Wilfred Spiegelenburg commented on YUNIKORN-2678:
-------------------------------------------------
I have not looked at this code in years.
However when I look at the it now I think the issue is in the {{nil || zero}}
check when we set the
"[used|https://github.com/apache/yunikorn-core/blob/v1.5.2/pkg/common/resources/resources.go#L488]"
value in the shares. That does not take into account that we have a large
discrepancy between resource types in absolute values. Resources like memory or
storage will always dominate above pods or GPUs.
Introducing max in the mix with guarantee will have side effects. I create a
queue with max memory set to 1TB, no guaranteed. I create a second queue with
max set to 1TB but a guaranteed memory of 100GB. Both queues use 50GB. In that
case share of queue 1 will be 0.05, queue 2 will have a share of 0.5 Queue 1
will win and get scheduled until it uses 500GB, with a guaranteed of 0. Queue 1
should not have a smaller share than queue 2 until all guaranteed is used.
That looks as broken as what we have now.
I could see two options:
# setting a fixed share value if not specified in guaranteed
# not adding anything to the shares unless set in guaranteed
Both options above will fix that same issue. I think option 2 above is the
better solution. We want to schedule on guaranteed setting. We need to test if
that still distributes fairly between the queues when one queue has a usage
over its guaranteed compared to a second queue with no guaranteed.
If we really want to have a policy for "least used queue" we can build one
based on the maximum resource and the usage.
The other option which would be nice to have would be adding a configurable
resource weights option like we have in the node sorting. That would be a new
feature...
> Yunikorn does not appear to be considering Guaranteed resources when
> allocating Pending Pods.
> ---------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-2678
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2678
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.5.1
> Environment: EKS 1.29
> Reporter: Paul Santa Clara
> Assignee: Paul Santa Clara
> Priority: Major
> Attachments: jira-queues.yaml, jira-tier0-screenshot.png,
> jira-tier1-screenshot.png, jira-tier2-screenshot.png,
> jira-tier3-screenshot.png
>
>
> Please see the attached queue configuration(jira-queues.yaml).
> I will create 100 pods in Tier0, 100 pods in Tier1, 100 pods in Tier2 and 100
> pods in Tier3. Each Pod will require 1 VCore. Initially, there will be 0
> suitable nodes to run the Pods and all will be Pending. Karpenter will soon
> provision Nodes and Yunikorn will react by binding the Pods.
> Given this
> [code|https://github.com/apache/yunikorn-core/blob/a786feb5761be28e802d08976d224c40639cd86b/pkg/scheduler/objects/sorters.go#L81C74-L81C95],
> I would expect Yunikorn to distribute the allocations such that each of the
> Tier’ed queues reaches its Guarantees. Instead, I observed a roughly even
> distribution of allocation across all of the queues.
> Tier0 fails to meet its Gaurantees while Tier3, for instance, dramatically
> overshoots them.
>
> {code:java}
> > kubectl get pods -n finance | grep tier-0 | grep Pending | wc -l
> 86
> > kubectl get pods -n finance | grep tier-1 | grep Pending | wc -l
> 83
> > kubectl get pods -n finance | grep tier-2 | grep Pending | wc -l
> 78
> > kubectl get pods -n finance | grep tier-3 | grep Pending | wc -l
> 77
> {code}
> Please see attached screen shots for queue usage.
> Note, this situation can also be reproduced without the use of Karpenter by
> simply setting Yunikorn's `service.schedulingInterval` to a high duration,
> say 1m. Doing so will force Yunikorn to react to 400 Pods -across 4 queues-
> at roughly the same time forcing prioritization of queue allocations.
> Test code to generate Pods:
> {code:java}
> from kubernetes import client, config
> config.load_kube_config()
> v1 = client.CoreV1Api()
> def create_pod_manifest(tier, exec,):
> pod_manifest = {
> 'apiVersion': 'v1',
> 'kind': 'Pod',
> 'metadata': {
> 'name': f"rolling-test-tier-{tier}-exec-{exec}",
> 'namespace': 'finance',
> 'labels': {
> 'applicationId': f"MyOwnApplicationId-tier-{tier}",
> 'queue': f"root.tiers.{tier}"
> },
> "yunikorn.apache.org/user.info":
> '{"user":"system:serviceaccount:finance:spark","groups":["system:serviceaccounts","system:serviceaccounts:finance","system:authenticated"]}'
> },
> 'spec': {
> "affinity": {
> "nodeAffinity" : {
> "requiredDuringSchedulingIgnoredDuringExecution" : {
> "nodeSelectorTerms" : [
> {
> "matchExpressions" : [
> {
> "key" : "di.rbx.com/dedicated",
> "operator" : "In",
> "values" : ["spark"]
> }
> ]
> }
> ]
> }
> },
> },
> "tolerations" : [
> {
> "effect" : "NoSchedule",
> "key": "dedicated",
> "operator" : "Equal",
> "value" : "spark"
> },
> ],
> "schedulerName": "yunikorn",
> 'restartPolicy': 'Always',
> 'containers': [{
> "name": "ubuntu",
> 'image': 'ubuntu',
> "command": ["sleep", "604800"],
> "imagePullPolicy": "IfNotPresent",
> "resources" : {
> "limits" : {
> 'cpu' : "1"
> },
> "requests" : {
> 'cpu' : "1"
> }
> }
> }]
> }
> }
> return pod_manifest
> for i in range(0,4):
> tier = str(i)
> for j in range(0,100):
> exec = str(j)
> pod_manifest = create_pod_manifest(tier, exec)
> print(pod_manifest)
> api_response = v1.create_namespaced_pod(body=pod_manifest,
> namespace="finance")
> print(f"creating tier( {tier} ) exec( {exec} )")
> {code}
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]