[
https://issues.apache.org/jira/browse/YUNIKORN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865211#comment-17865211
]
Craig Condit commented on YUNIKORN-2735:
----------------------------------------
We definitely do not want to disable reservation (that env var should have been
removed long ago). However, there may be room for some configuration around how
long an allocation may be reserved before we attempt to schedule something
elsewhere. [~wilfreds], any thoughts?
> YuniKorn doesn't schedule correctly after some pods were marked as
> Unschedulable
> --------------------------------------------------------------------------------
>
> Key: YUNIKORN-2735
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2735
> Project: Apache YuniKorn
> Issue Type: Bug
> Reporter: Volodymyr Kot
> Priority: Major
> Attachments: bug-logs, driver.yml, executor.yml, nodestate, podstate
>
>
> It is a bit of an edge case, but I can consistently reproduce this on master
> - see steps and comments used below:
> # Create a new cluster with kind, with 4 cpus/8Gb of memory
> # Deploy YuniKorn using helm
> # Set up service account for Spark
> ## "kubectl create serviceaccount spark"
> ## "kubectl create clusterrolebinding spark-role --clusterrole=edit
> --serviceaccount=default:spark --namespace=default"
> # Run kubectl proxy" to be able to run spark-submit
> # Create Spark application* 1 with driver and 2 executors - fits fully,
> placeholders are created and replaced
> # Create Spark application 2 with driver and 2 executors - only one executor
> placeholder is scheduled, rest of the pods are marked Unschedulable
> # Delete one of the executors from application 1
> # Spark driver re-creates the executor, it is marked as unschedulable
>
> At that point scheduler is "stuck", and won't schedule either executor from
> application 1 OR placeholder for executor from application 2 - it deems both
> of those unschedulable. See logs below, and please let me know if I
> misunderstood something/it is expected behavior!
>
> *Script used to run spark-submit:
> {code:java}
> ${SPARK_HOME}/bin/spark-submit --master k8s://http://localhost:8001
> --deploy-mode cluster --name spark-pi \
> --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi
> \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=2 \
> --conf spark.kubernetes.executor.request.cores=0.5 \
> --conf spark.kubernetes.container.image=docker.io/apache/spark:v3.4.0 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.driver.podTemplateFile=./driver.yml \
> --conf spark.kubernetes.executor.podTemplateFile=./executor.yml \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 30000 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]