[
https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun closed SPARK-47556.
---------------------------------
> [K8] Spark App ID collision resulting in deleting wrong resources
> -----------------------------------------------------------------
>
> Key: SPARK-47556
> URL: https://issues.apache.org/jira/browse/SPARK-47556
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes, Spark Core
> Affects Versions: 3.1
> Reporter: Sundeep K
> Priority: Major
>
> h3. Issue:
> We noticed that sometimes K8s executor pods go in a crash loop. Reason being
> 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon
> investigation we noticed that there are 2 spark jobs that launched with same
> application id and when one of them finishes first it deletes all it's
> resources and deletes the resources of other job too.
> -> Spark application ID is created using this
> [code|https://github.com/apache/spark/blob/36126a5c1821b4418afd5788963a939ea7f64078/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L38]
> "spark-application-" + System.currentTimeMillis
> This means if 2 applications launch at the same milli second they could end
> up having same AppId
> ->
> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
> label is added to all resource created by driver and it's value is
> application Id. Kubernetes Scheduler deletes all the apps with same
> [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
> upon termination.
> This results in deletion of config map and executor pods of job that's still
> running, driver tries to relaunch the executor pods, but config map is not
> present, so it's in crash loop
> h3. Context
> We are using [Spark of Kubernetes
> |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch
> our spark jobs using PySpark. We launch multiple Spark Jobs within a given
> k8s namespace. Each Spark job can be launched from different pods or from
> different processes in a pod. Every time a job is launched it has a unique
> app name. Here is how the job is launched (omitting irrelevant details):
> {code:java}
> # spark_conf has settings required for spark on k8s
> sp = SparkSession.builder \
> .config(conf=spark_conf) \
> .appName('testapp')
> sp.master(f'k8s://{kubernetes_host}')
> session = sp.getOrCreate()
> with session:
> session.sql('SELECT 1'){code}
> h3. Repro
> Set same app id in spark config, run 2 different jobs, one that finishes
> fast, one that runs slow. Slower job goes into crash loop
> {code:java}
> "spark.app.id": "<same Id for 2 spark job>"{code}
> h3. Workaround
> Set unique spark.app.id for all the jobs that run on k8s
> eg:
> {code:java}
> "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
> h3. Fix
> Add unique hash add the end of Application ID:
> [https://github.com/apache/spark/pull/45712]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]