Lucca Sergi created SPARK-46310:
-----------------------------------

             Summary: Cannot deploy Spark application using VolcanoFeatureStep 
to specify podGroupTemplate file 
                 Key: SPARK-46310
                 URL: https://issues.apache.org/jira/browse/SPARK-46310
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.4.1
            Reporter: Lucca Sergi


I'm trying to deploy a Spark application (version 3.4.1) on Kubernetes using 
Volcano as the scheduler. I define a VolcanoJob that represents the Spark 
driver - it has only one task, whose pod specification includes the driver 
container, which invokes the spark-submit command.

Following the official Spark documentation (available on "[Using Volcano as 
Customized Scheduler for Spark on 
Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]";),
 I define the necessary configuration parameters to make use of Volcano as the 
scheduler for my Spark workload:
{code:java}
/opt/spark/bin/spark-submit --name "volcano-spark-1" --deploy-mode="client" \
--class "org.apache.spark.examples.SparkPi" \
--conf spark.executor.instances="1" \
--conf 
spark.kubernetes.driver.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep"
 \
--conf 
spark.kubernetes.executor.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep"
 \
--conf 
spark.kubernetes.scheduler.volcano.podGroupTemplateFile="/var/template/podgroup.yaml"
 \
file:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar
{code}
In the block above, I omitted some Kubernetes configuration parameters that 
aren't important for this example. The parameter 
*{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}* points to a file 
mounted in the driver container. It has a content as follows:
{code:yaml}
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: pod-group-test
spec:
  minResources:
    cpu: "2"
    memory: "2Gi"
  queue: some-existing-queue
{code}
I manually verified that this file "/var/template/podgroup.yaml" exists in the 
container before the "spark-submit" command is issued. I also granted all the 
necessary RBAC permissions so that the driver pod can interact with Kubernetes 
objects (pods, VolcanoJobs, podgroups, queues, etc.).

When I execute this VolcanoJob, I see only the driver pod being created, and 
when inspecting its logs, I see the following error:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: 
https://api.<masked-environment-endpoint>/api/v1/namespaces/04522055-15b3-40d8-ba07-22b1a2a5ffcc/pods.
 Message: admission webhook "validatepod.volcano.sh" denied the request: failed 
to get PodGroup for pod 
<04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: 
podgroups.scheduling.volcano.sh 
"spark-5ad570e340934d3997065fa6d504910e-podgroup" not found. Received status: 
Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission 
webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for 
pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: 
podgroups.scheduling.volcano.sh 
"spark-5ad570e340934d3997065fa6d504910e-podgroup" not found, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, 
status=Failure, additionalProperties={}).
        at 
io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
        at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538)
        at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558)
        at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349)
        at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:711)
        at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93)
        at 
io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
        at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113)
        at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:182)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:296)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:838)
{code}
The error seems to be triggered when the driver attempts to deploy the 
executors of my Spark application. In the error message, it says that the 
podGroup "spark-5ad570e340934d3997065fa6d504910e-podgroup" cannot be found 
(pointed out by the Volcano admission hook).

I was expecting that the driver and executors would be assigned to the same 
PodGroup object, created by the VolcanoFeatureStep using the template file that 
I provided through the configuration parameter 
"{*}{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}{*}". With that, 
I would have a proper batch scheduling of my Spark application, as driver and 
executor pods would reside in the same pod group, and would be scheduled 
together by Volcano. But instead, only the driver pod is deployed, and the 
error seen above is found on its logs.

The documentation "[Using Volcano as Customized Scheduler for Spark on 
Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]";
 leads me to understand that by providing the PodGroup template file, my Spark 
application (i.e., driver and executors) would be allocated in the same 
PodGroup object, following the specification I provided. That doesn't seem to 
be the case, and it looks like the PodGroup isn't created following the 
provided template, nor can the executors be created.

Some more details about the environment I used:
 - Volcano Version: v1.8.0
 - Spark Version: 3.4.1
 - Kubernetes version: v1.26.7
 - Cloud provider: GCP



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to