[ https://issues.apache.org/jira/browse/SPARK-46310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795320#comment-17795320 ]
Lucca Sergi commented on SPARK-46310: ------------------------------------- Just saw [https://github.com/volcano-sh/volcano/issues/3250] - can someone please confirm that VolcanoFeatureStep doesn't support Spark in client mode? If that's the case, what would be the alternative for using Volcano with client mode deployment? > Cannot deploy Spark application using VolcanoFeatureStep to specify > podGroupTemplate file > ------------------------------------------------------------------------------------------ > > Key: SPARK-46310 > URL: https://issues.apache.org/jira/browse/SPARK-46310 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.4.1 > Reporter: Lucca Sergi > Priority: Major > > I'm trying to deploy a Spark application (version 3.4.1) on Kubernetes using > Volcano as the scheduler. I define a > [VolcanoJob|https://volcano.sh/en/docs/vcjob/] that represents the Spark > driver - it has only one task, whose pod specification includes the driver > container, which invokes the spark-submit command. > Following the official Spark documentation (available on "[Using Volcano as > Customized Scheduler for Spark on > Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]"), > I define the necessary configuration parameters to make use of Volcano as > the scheduler for my Spark workload: > {code:java} > /opt/spark/bin/spark-submit --name "volcano-spark-1" --deploy-mode="client" \ > --class "org.apache.spark.examples.SparkPi" \ > --conf spark.executor.instances="1" \ > --conf > spark.kubernetes.driver.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" > \ > --conf > spark.kubernetes.executor.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" > \ > --conf > spark.kubernetes.scheduler.volcano.podGroupTemplateFile="/var/template/podgroup.yaml" > \ > file:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar > {code} > In the block above, I omitted some Kubernetes configuration parameters that > aren't important for this example. The parameter > *{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}* points to a > file mounted in the driver container. It has a content just as the following > example (cpu / memory values may vary): > {code:yaml} > apiVersion: scheduling.volcano.sh/v1beta1 > kind: PodGroup > metadata: > name: pod-group-test > spec: > minResources: > cpu: "2" > memory: "2Gi" > queue: some-existing-queue > {code} > I manually verified that this file "/var/template/podgroup.yaml" exists in > the container before the "spark-submit" command is issued. I also granted all > the necessary RBAC permissions so that the driver pod can interact with > Kubernetes objects (pods, VolcanoJobs, podgroups, queues, etc.). > When I execute this VolcanoJob, I see only the driver pod being created, and > when inspecting its logs, I see the following error: > {code:java} > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: > https://api.<masked-environment-endpoint>/api/v1/namespaces/04522055-15b3-40d8-ba07-22b1a2a5ffcc/pods. > Message: admission webhook "validatepod.volcano.sh" denied the request: > failed to get PodGroup for pod > <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: > podgroups.scheduling.volcano.sh > "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found. Received status: > Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission > webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup > for pod > <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: > podgroups.scheduling.volcano.sh > "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found, > metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, > status=Failure, additionalProperties={}). > at > io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238) > at > io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538) > at > io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558) > at > io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349) > at > io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:711) > at > io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93) > at > io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42) > at > io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113) > at > io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:182) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:296) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:838) > {code} > The error seems to be triggered when the driver attempts to deploy the > executors of my Spark application. In the error message, it says that the > podGroup "spark-5ad570e340934d3997065fa6d504910e-podgroup" cannot be found > (pointed out by the Volcano admission hook). > I was expecting that the driver and executors would be assigned to the same > PodGroup object, created by the VolcanoFeatureStep using the template file > that I provided through the configuration parameter > "{*}{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}{*}". With > that, I would have a proper batch scheduling of my Spark application, as > driver and executor pods would reside in the same pod group, and would be > scheduled together by Volcano. But instead, only the driver pod is deployed, > and the error seen above is found on its logs. > The documentation "[Using Volcano as Customized Scheduler for Spark on > Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]" > leads me to understand that by providing the PodGroup template file, my > Spark application (i.e., driver and executors) would be allocated in the same > PodGroup object, following the specification I provided. That doesn't seem to > be the case, and it looks like the PodGroup isn't created following the > provided template, nor can the executors be created. > Some more details about the environment I used: > - Volcano Version: v1.8.0 > - Spark Version: 3.4.1 > - Kubernetes version: v1.26.7 > - Cloud provider: GCP -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org