Paul Santa Clara created YUNIKORN-2678:
------------------------------------------
Summary: Yunikorn does not appear to be considering Guaranteed
resources when allocating Pending Pods.
Key: YUNIKORN-2678
URL: https://issues.apache.org/jira/browse/YUNIKORN-2678
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Affects Versions: 1.5.1
Environment: EKS 1.29
Reporter: Paul Santa Clara
Attachments: jira-queues.yaml, jira-tier0-screenshot.png,
jira-tier1-screenshot.png, jira-tier2-screenshot.png, jira-tier3-screenshot.png
Please see the attached queue configuration(jira-queues.yaml).
I will create 100 pods in Tier0, 100 pods in Tier1, 100 pods in Tier2 and 100
pods in Tier3. Each Pod will require 1 VCore. Initially, there will be 0
suitable nodes to run the Pods and all will be Pending. Karpenter will soon
provision Nodes and Yunikorn will react by binding the Pods.
Given this
[code|https://github.com/apache/yunikorn-core/blob/a786feb5761be28e802d08976d224c40639cd86b/pkg/scheduler/objects/sorters.go#L81C74-L81C95],
I would expect Yunikorn to distribute the allocations such that each of the
Tier’ed queues reaches its Guarantees. Instead, I observed a roughly even
distribution of allocation across all of the queues.
Tier0 fails to meet its Gaurantees while Tier3, for instance, dramatically
overshoots them.
{code:java}
> kubectl get pods -n finance | grep tier-0 | grep Pending | wc -l
86
> kubectl get pods -n finance | grep tier-1 | grep Pending | wc -l
83
> kubectl get pods -n finance | grep tier-2 | grep Pending | wc -l
78
> kubectl get pods -n finance | grep tier-3 | grep Pending | wc -l
77
{code}
Please see attached screen shots for queue usage.
Note, this situation can also be reproduced without the use of Karpenter by
simply setting Yunikorn's `service.schedulingInterval` to a high duration, say
1m. Doing so will force Yunikorn to react to 400 Pods -across 4 queues- at
roughly the same time forcing prioritization of queue allocations.
Test code to generate Pods:
{code:java}
from kubernetes import client, config
config.load_kube_config()
v1 = client.CoreV1Api()
def create_pod_manifest(tier, exec,):
pod_manifest = {
'apiVersion': 'v1',
'kind': 'Pod',
'metadata': {
'name': f"rolling-test-tier-{tier}-exec-{exec}",
'namespace': 'finance',
'labels': {
'applicationId': f"MyOwnApplicationId-tier-{tier}",
'queue': f"root.tiers.{tier}"
},
"yunikorn.apache.org/user.info":
'{"user":"system:serviceaccount:finance:spark","groups":["system:serviceaccounts","system:serviceaccounts:finance","system:authenticated"]}'
},
'spec': {
"affinity": {
"nodeAffinity" : {
"requiredDuringSchedulingIgnoredDuringExecution" : {
"nodeSelectorTerms" : [
{
"matchExpressions" : [
{
"key" : "di.rbx.com/dedicated",
"operator" : "In",
"values" : ["spark"]
}
]
}
]
}
},
},
"tolerations" : [
{
"effect" : "NoSchedule",
"key": "dedicated",
"operator" : "Equal",
"value" : "spark"
},
],
"schedulerName": "yunikorn",
'restartPolicy': 'Always',
'containers': [{
"name": "ubuntu",
'image': 'ubuntu',
"command": ["sleep", "604800"],
"imagePullPolicy": "IfNotPresent",
"resources" : {
"limits" : {
'cpu' : "1"
},
"requests" : {
'cpu' : "1"
}
}
}]
}
}
return pod_manifest
for i in range(0,4):
tier = str(i)
for j in range(0,100):
exec = str(j)
pod_manifest = create_pod_manifest(tier, exec)
print(pod_manifest)
api_response = v1.create_namespaced_pod(body=pod_manifest,
namespace="finance")
print(f"creating tier( {tier} ) exec( {exec} )")
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]