[jira] [Commented] (SPARK-46310) Cannot deploy Spark application using VolcanoFeatureStep to specify podGroupTemplate file

Lucca Sergi (Jira) Mon, 11 Dec 2023 04:36:03 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-46310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795320#comment-17795320
 ]


Lucca Sergi commented on SPARK-46310:
-------------------------------------

Just saw [https://github.com/volcano-sh/volcano/issues/3250]  - can someone 
please confirm that VolcanoFeatureStep doesn't support Spark in client mode? If 
that's the case, what would be the alternative for using Volcano with client 
mode deployment?

> Cannot deploy Spark application using VolcanoFeatureStep to specify 
> podGroupTemplate file 
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-46310
>                 URL: https://issues.apache.org/jira/browse/SPARK-46310
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.4.1
>            Reporter: Lucca Sergi
>            Priority: Major
>
> I'm trying to deploy a Spark application (version 3.4.1) on Kubernetes using 
> Volcano as the scheduler. I define a 
> [VolcanoJob|https://volcano.sh/en/docs/vcjob/] that represents the Spark 
> driver - it has only one task, whose pod specification includes the driver 
> container, which invokes the spark-submit command.
> Following the official Spark documentation (available on "[Using Volcano as 
> Customized Scheduler for Spark on 
> Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]";),
>  I define the necessary configuration parameters to make use of Volcano as 
> the scheduler for my Spark workload:
> {code:java}
> /opt/spark/bin/spark-submit --name "volcano-spark-1" --deploy-mode="client" \
> --class "org.apache.spark.examples.SparkPi" \
> --conf spark.executor.instances="1" \
> --conf 
> spark.kubernetes.driver.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep"
>  \
> --conf 
> spark.kubernetes.executor.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep"
>  \
> --conf 
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile="/var/template/podgroup.yaml"
>  \
> file:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar
> {code}
> In the block above, I omitted some Kubernetes configuration parameters that 
> aren't important for this example. The parameter 
> *{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}* points to a 
> file mounted in the driver container. It has a content just as the following 
> example (cpu / memory values may vary):
> {code:yaml}
> apiVersion: scheduling.volcano.sh/v1beta1
> kind: PodGroup
> metadata: 
>   name: pod-group-test
> spec: 
>   minResources: 
>     cpu: "2"
>     memory: "2Gi"
>   queue: some-existing-queue
> {code}
> I manually verified that this file "/var/template/podgroup.yaml" exists in 
> the container before the "spark-submit" command is issued. I also granted all 
> the necessary RBAC permissions so that the driver pod can interact with 
> Kubernetes objects (pods, VolcanoJobs, podgroups, queues, etc.).
> When I execute this VolcanoJob, I see only the driver pod being created, and 
> when inspecting its logs, I see the following error:
> {code:java}
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: 
> https://api.<masked-environment-endpoint>/api/v1/namespaces/04522055-15b3-40d8-ba07-22b1a2a5ffcc/pods.
>  Message: admission webhook "validatepod.volcano.sh" denied the request: 
> failed to get PodGroup for pod 
> <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: 
> podgroups.scheduling.volcano.sh 
> "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found. Received status: 
> Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission 
> webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup 
> for pod 
> <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: 
> podgroups.scheduling.volcano.sh 
> "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found, 
> metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, 
> status=Failure, additionalProperties={}).
>       at 
> io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:711)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113)
>       at 
> io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440)
>       at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363)
>       at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>       at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:182)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:296)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:838)
> {code}
> The error seems to be triggered when the driver attempts to deploy the 
> executors of my Spark application. In the error message, it says that the 
> podGroup "spark-5ad570e340934d3997065fa6d504910e-podgroup" cannot be found 
> (pointed out by the Volcano admission hook).
> I was expecting that the driver and executors would be assigned to the same 
> PodGroup object, created by the VolcanoFeatureStep using the template file 
> that I provided through the configuration parameter 
> "{*}{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}{*}". With 
> that, I would have a proper batch scheduling of my Spark application, as 
> driver and executor pods would reside in the same pod group, and would be 
> scheduled together by Volcano. But instead, only the driver pod is deployed, 
> and the error seen above is found on its logs.
> The documentation "[Using Volcano as Customized Scheduler for Spark on 
> Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]";
>  leads me to understand that by providing the PodGroup template file, my 
> Spark application (i.e., driver and executors) would be allocated in the same 
> PodGroup object, following the specification I provided. That doesn't seem to 
> be the case, and it looks like the PodGroup isn't created following the 
> provided template, nor can the executors be created.
> Some more details about the environment I used:
>  - Volcano Version: v1.8.0
>  - Spark Version: 3.4.1
>  - Kubernetes version: v1.26.7
>  - Cloud provider: GCP



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46310) Cannot deploy Spark application using VolcanoFeatureStep to specify podGroupTemplate file

Reply via email to