[
https://issues.apache.org/jira/browse/YUNIKORN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871475#comment-17871475
]
Craig Condit commented on YUNIKORN-2678:
----------------------------------------
I also hadn't looked at this code for a very long time, but have been working
closely with Paul to help develop the new solution. I'm convinced after combing
through this that the entire shares[] comparison logic is utterly broken by
design. There's simply no way to get consistent results out of it. The shares
are not returned in a consistent order, and the first value found (which could
be pods, memory, cpu, ephemeral storage or really anything else) becomes the
sort. In Paul's example it was ephemeral storage. So, queue with less usage
along that dimension became the first queue always.
The logic currently also falls back to returning an absolute resource value in
the case where no guarantee is set. This means you get either a fractional
value (typically 0 to 1, but could be larger if queue is overcommited) and a
lack of guarantee in another queue for that resource just returns the total
usage (which is almost certainly orders of magnitude larger than a fraction.
Even 2 bytes of RAM or 2 millicores will sort as larger. It's simply
nonsensical to compare fractional usage values with absolute units.
{quote}Introducing max in the mix with guarantee will have side effects. I
create a queue with max memory set to 1TB, no guaranteed. I create a second
queue with max set to 1TB but a guaranteed memory of 100GB. Both queues use
50GB. In that case share of queue 1 will be 0.05, queue 2 will have a share of
0.5 Queue 1 will win and get scheduled until it uses 500GB, with a guaranteed
of 0. Queue 1 should not have a smaller share than queue 2 until all guaranteed
is used.
That looks as broken as what we have now.
{quote}
I disagree. Treating lack of (specific) guarantee as == max aligns with how
this is already treated for the purposes of preemption. In that case, if no
guarantee is supplied, the guarantee is treated as unlimited (or equal to max
if you prefer). The result is that lack of guarantee doesn't allow a queue to
either trigger preemption or *be preempted itself* (which is the part that
works like guaranteed implicitly equalling max).
The proposed algorithm for fair sorting works similarly, and if you take your
example using those semantics, it actually makes sense. Queue 1 has a guarantee
of 100GB, Queue 2 has an *implicit* guarantee of 1TB. Both have *implicit*
guarantees of _*any other resource type* {*}not mentioned as well{*}._ This
ensures that even if guaranteed resources are not supplied (which is the
default configuration), fair scheduling still works consistently.
The only edge case where things are slightly strange is if sibling queues have
only partial guarantees applied. IMO this is already dubious as a
configuration, though at least in the new algorithm it works consistently given
the rules. The current implementation is practically random() sort order by
comparison.
> Yunikorn does not appear to be considering Guaranteed resources when
> allocating Pending Pods.
> ---------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-2678
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2678
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.5.1
> Environment: EKS 1.29
> Reporter: Paul Santa Clara
> Assignee: Paul Santa Clara
> Priority: Major
> Attachments: Screenshot 2024-08-06 at 5.18.18 PM.png, Screenshot
> 2024-08-06 at 5.18.21 PM.png, Screenshot 2024-08-06 at 5.18.30 PM.png,
> jira-queues.yaml, jira-tier0-screenshot.png, jira-tier1-screenshot.png,
> jira-tier2-screenshot.png, jira-tier3-screenshot.png
>
>
> Please see the attached queue configuration(jira-queues.yaml).
> I will create 100 pods in Tier0, 100 pods in Tier1, 100 pods in Tier2 and 100
> pods in Tier3. Each Pod will require 1 VCore. Initially, there will be 0
> suitable nodes to run the Pods and all will be Pending. Karpenter will soon
> provision Nodes and Yunikorn will react by binding the Pods.
> Given this
> [code|https://github.com/apache/yunikorn-core/blob/a786feb5761be28e802d08976d224c40639cd86b/pkg/scheduler/objects/sorters.go#L81C74-L81C95],
> I would expect Yunikorn to distribute the allocations such that each of the
> Tier’ed queues reaches its Guarantees. Instead, I observed a roughly even
> distribution of allocation across all of the queues.
> Tier0 fails to meet its Gaurantees while Tier3, for instance, dramatically
> overshoots them.
>
> {code:java}
> > kubectl get pods -n finance | grep tier-0 | grep Pending | wc -l
> 86
> > kubectl get pods -n finance | grep tier-1 | grep Pending | wc -l
> 83
> > kubectl get pods -n finance | grep tier-2 | grep Pending | wc -l
> 78
> > kubectl get pods -n finance | grep tier-3 | grep Pending | wc -l
> 77
> {code}
> Please see attached screen shots for queue usage.
> Note, this situation can also be reproduced without the use of Karpenter by
> simply setting Yunikorn's `service.schedulingInterval` to a high duration,
> say 1m. Doing so will force Yunikorn to react to 400 Pods -across 4 queues-
> at roughly the same time forcing prioritization of queue allocations.
> Test code to generate Pods:
> {code:java}
> from kubernetes import client, config
> config.load_kube_config()
> v1 = client.CoreV1Api()
> def create_pod_manifest(tier, exec,):
> pod_manifest = {
> 'apiVersion': 'v1',
> 'kind': 'Pod',
> 'metadata': {
> 'name': f"rolling-test-tier-{tier}-exec-{exec}",
> 'namespace': 'finance',
> 'labels': {
> 'applicationId': f"MyOwnApplicationId-tier-{tier}",
> 'queue': f"root.tiers.{tier}"
> },
> "yunikorn.apache.org/user.info":
> '{"user":"system:serviceaccount:finance:spark","groups":["system:serviceaccounts","system:serviceaccounts:finance","system:authenticated"]}'
> },
> 'spec': {
> "affinity": {
> "nodeAffinity" : {
> "requiredDuringSchedulingIgnoredDuringExecution" : {
> "nodeSelectorTerms" : [
> {
> "matchExpressions" : [
> {
> "key" : "di.rbx.com/dedicated",
> "operator" : "In",
> "values" : ["spark"]
> }
> ]
> }
> ]
> }
> },
> },
> "tolerations" : [
> {
> "effect" : "NoSchedule",
> "key": "dedicated",
> "operator" : "Equal",
> "value" : "spark"
> },
> ],
> "schedulerName": "yunikorn",
> 'restartPolicy': 'Always',
> 'containers': [{
> "name": "ubuntu",
> 'image': 'ubuntu',
> "command": ["sleep", "604800"],
> "imagePullPolicy": "IfNotPresent",
> "resources" : {
> "limits" : {
> 'cpu' : "1"
> },
> "requests" : {
> 'cpu' : "1"
> }
> }
> }]
> }
> }
> return pod_manifest
> for i in range(0,4):
> tier = str(i)
> for j in range(0,100):
> exec = str(j)
> pod_manifest = create_pod_manifest(tier, exec)
> print(pod_manifest)
> api_response = v1.create_namespaced_pod(body=pod_manifest,
> namespace="finance")
> print(f"creating tier( {tier} ) exec( {exec} )")
> {code}
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]