[jira] [Commented] (YUNIKORN-2735) YuniKorn doesn't schedule correctly after some pods were marked as Unschedulable

Craig Condit (Jira) Fri, 12 Jul 2024 08:30:06 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865524#comment-17865524
 ]


Craig Condit commented on YUNIKORN-2735:
----------------------------------------

{quote}We have/had a TODO in the code to make this configurable. Currently it 
is fixed to 2 seconds. It should be a reloadable configuration value. I would 
also argue that the current 2 seconds is too quick and 30 seconds would allow 
us to be a bit more eager.

I would propose the following setup:
 * configuration name: service.ReservationDelay
 * granularity: seconds
 * default: 30 seconds
 * minimum: 2 seconds (allow current behaviour)
 * maximum: 3600 seconds (prevent starvation and turning off reservations)
 * reloadable: true
 * notes:
 ** old reservations are not re-evaluated when the value is changed
 ** settings outside the minimum..maximum range will use the default
 ** when reloading the value is not changed if outside the range{quote}
A few observations on this. There's a few different potential tuneables here I 
think. The first is the existing 2-second delay which you're referencing – that 
controls how long we allow an outstanding request to be unscheduled before we 
attempt a reservation (and I agree with 2 seconds being too low as a default). 
The other is I think what Elad and Volodymyr might be more interested in, and 
that is a ceiling for how long a node may be reserved before we give up and 
schedule something else anyway. Currently that is unlimited, which I can see an 
issue with. We may also want some sort of a limit on how many (either absolute 
or percentage of nodes) which are allowed to be reserved at any given time.

Consider this scenario:

5-node cluster, fair scheduling policy (so nodes get loaded relatively evenly). 
Let's ignore specific resources, and just consider percentage of allocation. 
Let's say we have existing running workloads consuming ~ 50% of each node. Now 
assume 5 requests arrive for pods which would consume 75% of a node 
individually. None are schedulable, as there isn't sufficient resources 
available. After 2 seconds, we will start reserving nodes, and now we have all 
5 nodes in the cluster reserved for very large pods. Now suppose a bunch of 
small allocations arrive waiting to be scheduled. Because all 5 nodes are 
reserved, until (or if) capacity frees up on at least one of them, not only 
will the large allocations not be scheduled, the smaller ones won't either. 
This is technically working as designed; we're attempting to ensure that the 
large pods get scheduled eventually, but if resources never free down to < 25% 
on any given node, we could be waiting a long time (or forever). Meanwhile, to 
an outside observer, the scheduler is "stuck". 

So how do we solve for this? One option would be to limit how many nodes can be 
reserved at once. If in the same scenario we had a tunable that limited 
reservations to no more than 50% of the nodes, we'd reserve only 2 of the 5 
large allocations on 2 different nodes (still allowing the large allocations to 
make forward progress, but ensuring that reservations can't hold the whole 
cluster hostage). Additionally we could have a time limit on allocations so 
that a reservation can't hold a node for longer than a configurable time limit. 
A reasonable default would be hard to gauge without knowing how particular 
workloads work, but let's say 5 minutes. A combination of the two options would 
probably be beneficial in a wide variety of scenarios.

As for the specifics of configuring your proposed ReservationDelay property – 
I'm not a fan of setting min/max on things like this. There could be legit 
reasons to use > 1h delays. For simplicity, just say it's defined as seconds, 
and must be a positive integer > 0. I also would like to see reload behavior 
match startup behavior in the case of values outside acceptable ranges. We 
already have too many places where reload behavior doesn't match startup, we 
should use identical logic in both cases.

To summarize, how about this:
 * service.reservationDelay: "30s" (must be positive > 0)
 * service.reservationTimeout: "15m" (must be positive >= 0 – zero allowed to 
support current behavior)
 * service.reservationNodePercentage: "0.5" (positive float between 0 and 1 
inclusive)

 

> YuniKorn doesn't schedule correctly after some pods were marked as 
> Unschedulable
> --------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2735
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2735
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Volodymyr Kot
>            Priority: Major
>         Attachments: bug-logs, driver.yml, executor.yml, nodestate, podstate
>
>
> It is a bit of an edge case, but I can consistently reproduce this on master 
> - see steps and comments used below:
>  # Create a new cluster with kind, with 4 cpus/8Gb of memory
>  # Deploy YuniKorn using helm
>  # Set up service account for Spark
>  ## "kubectl create serviceaccount spark"
>  ## "kubectl create clusterrolebinding spark-role --clusterrole=edit 
> --serviceaccount=default:spark --namespace=default"
>  # Run kubectl proxy" to be able to run spark-submit
>  # Create Spark application* 1 with driver and 2 executors - fits fully, 
> placeholders are created and replaced
>  # Create Spark application 2 with driver and 2 executors - only one executor 
> placeholder is scheduled, rest of the pods are marked Unschedulable
>  # Delete one of the executors from application 1
>  # Spark driver re-creates the executor, it is marked as unschedulable
>  
> At that point scheduler is "stuck", and won't schedule either executor from 
> application 1 OR placeholder for executor from application 2 - it deems both 
> of those unschedulable. See logs below, and please let me know if I 
> misunderstood something/it is expected behavior!
>  
> *Script used to run spark-submit:
> {code:java}
> ${SPARK_HOME}/bin/spark-submit --master k8s://http://localhost:8001 
> --deploy-mode cluster --name spark-pi \
>    --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi 
> \
>    --class org.apache.spark.examples.SparkPi \
>    --conf spark.executor.instances=2 \
>    --conf spark.kubernetes.executor.request.cores=0.5 \
>    --conf spark.kubernetes.container.image=docker.io/apache/spark:v3.4.0 \
>    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>    --conf spark.kubernetes.driver.podTemplateFile=./driver.yml \
>    --conf spark.kubernetes.executor.podTemplateFile=./executor.yml \
>    local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 30000 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2735) YuniKorn doesn't schedule correctly after some pods were marked as Unschedulable

Reply via email to