[ 
https://issues.apache.org/jira/browse/YUNIKORN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865266#comment-17865266
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2735:
-------------------------------------------------

{quote}At that point scheduler is "stuck", and won't schedule either executor 
from application 1 OR placeholder for executor from application 2 - it deems 
both of those unschedulable. See logs below, and please let me know if I 
misunderstood something/it is expected behavior!
{quote}
It is expected behaviour. The scheduler is not stuck. This will resolve itself.

First spark application 2: as this is gang scheduling the placeholders will 
time out (15 min by default). If not all of the placeholders were allocated at 
the point of the timeout a cleanup will be triggered. This removes all 
placeholder pods for the system. Depending on the gang style, hard or soft, we 
either fail the application or release the driver pod for scheduling. At that 
point you are unblocked.

Application 1 pods will get scheduled based on the availability of resources. 
When the placeholder pod(s) time out the existing pending pods will be 
scheduled. At that point the normal sorting rules apply. This _could_ mean that 
the re-submitted executor pod gets scheduled or some other pod that was waiting.

Gang scheduling allows you to reserve resources but it does not guarantee them 
after replacement. If you kill the executor pod and it gets restarted it is 
just another pod on the cluster that needs to be scheduled. It will thus depend 
on your config (FIFO, priority, pod definition etc) how and when that 
scheduling will happen. The newly started executor is really a new pod from the 
K8s view, different submit time etc. If you have FIFO configured it will end up 
in the back of the scheduling queue.

Gang scheduling with the soft style will also not prevent starving a cluster of 
resources. You could have the case that the total gang request is too large to 
fit into the free space on a busy cluster. First triggering reservations 
blocking resources for other applications. Then after the timeout you could 
slowly fill your cluster with driver pods that do not get what they want and 
thus only slowly progress or not progress at all. The only option you have for 
that is limit the number of applications you allow to run in a queue 
(MaxApplications). This case can easily happen in any size cluster.

None of these are real scheduler issues, they are cluster management issues. 
You cannot expect the scheduler to understand the workload you put on a cluster 
and magically adjust.

> YuniKorn doesn't schedule correctly after some pods were marked as 
> Unschedulable
> --------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2735
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2735
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Volodymyr Kot
>            Priority: Major
>         Attachments: bug-logs, driver.yml, executor.yml, nodestate, podstate
>
>
> It is a bit of an edge case, but I can consistently reproduce this on master 
> - see steps and comments used below:
>  # Create a new cluster with kind, with 4 cpus/8Gb of memory
>  # Deploy YuniKorn using helm
>  # Set up service account for Spark
>  ## "kubectl create serviceaccount spark"
>  ## "kubectl create clusterrolebinding spark-role --clusterrole=edit 
> --serviceaccount=default:spark --namespace=default"
>  # Run kubectl proxy" to be able to run spark-submit
>  # Create Spark application* 1 with driver and 2 executors - fits fully, 
> placeholders are created and replaced
>  # Create Spark application 2 with driver and 2 executors - only one executor 
> placeholder is scheduled, rest of the pods are marked Unschedulable
>  # Delete one of the executors from application 1
>  # Spark driver re-creates the executor, it is marked as unschedulable
>  
> At that point scheduler is "stuck", and won't schedule either executor from 
> application 1 OR placeholder for executor from application 2 - it deems both 
> of those unschedulable. See logs below, and please let me know if I 
> misunderstood something/it is expected behavior!
>  
> *Script used to run spark-submit:
> {code:java}
> ${SPARK_HOME}/bin/spark-submit --master k8s://http://localhost:8001 
> --deploy-mode cluster --name spark-pi \
>    --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi 
> \
>    --class org.apache.spark.examples.SparkPi \
>    --conf spark.executor.instances=2 \
>    --conf spark.kubernetes.executor.request.cores=0.5 \
>    --conf spark.kubernetes.container.image=docker.io/apache/spark:v3.4.0 \
>    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>    --conf spark.kubernetes.driver.podTemplateFile=./driver.yml \
>    --conf spark.kubernetes.executor.podTemplateFile=./executor.yml \
>    local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 30000 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to