[
https://issues.apache.org/jira/browse/YUNIKORN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865266#comment-17865266
]
Wilfred Spiegelenburg commented on YUNIKORN-2735:
-------------------------------------------------
{quote}At that point scheduler is "stuck", and won't schedule either executor
from application 1 OR placeholder for executor from application 2 - it deems
both of those unschedulable. See logs below, and please let me know if I
misunderstood something/it is expected behavior!
{quote}
It is expected behaviour. The scheduler is not stuck. This will resolve itself.
First spark application 2: as this is gang scheduling the placeholders will
time out (15 min by default). If not all of the placeholders were allocated at
the point of the timeout a cleanup will be triggered. This removes all
placeholder pods for the system. Depending on the gang style, hard or soft, we
either fail the application or release the driver pod for scheduling. At that
point you are unblocked.
Application 1 pods will get scheduled based on the availability of resources.
When the placeholder pod(s) time out the existing pending pods will be
scheduled. At that point the normal sorting rules apply. This _could_ mean that
the re-submitted executor pod gets scheduled or some other pod that was waiting.
Gang scheduling allows you to reserve resources but it does not guarantee them
after replacement. If you kill the executor pod and it gets restarted it is
just another pod on the cluster that needs to be scheduled. It will thus depend
on your config (FIFO, priority, pod definition etc) how and when that
scheduling will happen. The newly started executor is really a new pod from the
K8s view, different submit time etc. If you have FIFO configured it will end up
in the back of the scheduling queue.
Gang scheduling with the soft style will also not prevent starving a cluster of
resources. You could have the case that the total gang request is too large to
fit into the free space on a busy cluster. First triggering reservations
blocking resources for other applications. Then after the timeout you could
slowly fill your cluster with driver pods that do not get what they want and
thus only slowly progress or not progress at all. The only option you have for
that is limit the number of applications you allow to run in a queue
(MaxApplications). This case can easily happen in any size cluster.
None of these are real scheduler issues, they are cluster management issues.
You cannot expect the scheduler to understand the workload you put on a cluster
and magically adjust.
> YuniKorn doesn't schedule correctly after some pods were marked as
> Unschedulable
> --------------------------------------------------------------------------------
>
> Key: YUNIKORN-2735
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2735
> Project: Apache YuniKorn
> Issue Type: Bug
> Reporter: Volodymyr Kot
> Priority: Major
> Attachments: bug-logs, driver.yml, executor.yml, nodestate, podstate
>
>
> It is a bit of an edge case, but I can consistently reproduce this on master
> - see steps and comments used below:
> # Create a new cluster with kind, with 4 cpus/8Gb of memory
> # Deploy YuniKorn using helm
> # Set up service account for Spark
> ## "kubectl create serviceaccount spark"
> ## "kubectl create clusterrolebinding spark-role --clusterrole=edit
> --serviceaccount=default:spark --namespace=default"
> # Run kubectl proxy" to be able to run spark-submit
> # Create Spark application* 1 with driver and 2 executors - fits fully,
> placeholders are created and replaced
> # Create Spark application 2 with driver and 2 executors - only one executor
> placeholder is scheduled, rest of the pods are marked Unschedulable
> # Delete one of the executors from application 1
> # Spark driver re-creates the executor, it is marked as unschedulable
>
> At that point scheduler is "stuck", and won't schedule either executor from
> application 1 OR placeholder for executor from application 2 - it deems both
> of those unschedulable. See logs below, and please let me know if I
> misunderstood something/it is expected behavior!
>
> *Script used to run spark-submit:
> {code:java}
> ${SPARK_HOME}/bin/spark-submit --master k8s://http://localhost:8001
> --deploy-mode cluster --name spark-pi \
> --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi
> \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=2 \
> --conf spark.kubernetes.executor.request.cores=0.5 \
> --conf spark.kubernetes.container.image=docker.io/apache/spark:v3.4.0 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.driver.podTemplateFile=./driver.yml \
> --conf spark.kubernetes.executor.podTemplateFile=./executor.yml \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 30000 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]