[
https://issues.apache.org/jira/browse/SPARK-47475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiale Tan updated SPARK-47475:
------------------------------
Component/s: Spark Core
> Jar Download Under K8s Cluster Mode Causes Executors Scaling Issues
> --------------------------------------------------------------------
>
> Key: SPARK-47475
> URL: https://issues.apache.org/jira/browse/SPARK-47475
> Project: Spark
> Issue Type: Bug
> Components: Deploy, Kubernetes, Spark Core
> Affects Versions: 3.4.0, 3.5.0
> Reporter: Jiale Tan
> Priority: Major
>
> {*}Context{*}:
> To submit spark jobs to Kubernetes under cluster mode, the {{spark-submit}}
> will be triggered twice.
> The first time {{SparkSubmit}} will run under k8s cluster mode, it will
> append primary resource to {{spark.jars}} and call
> {{KubernetesClientApplication::start}} to create a driver pod.
> The driver pod will run {{spark-submit}} again with the same primary resource
> jar. However this time the {{SparkSubmit}} will run under client mode with
> {{spark.kubernetes.submitInDriver}} as {{true}}, plus the updated
> {{spark.jars}}. Under this mode, {{SparkSubmit}} will download all the jars
> in {{spark.jars}} to driver and those {{spark.jars}} urls will be replaced by
> the driver local paths.
> Then SparkSubmit will append the same primary resource to spark.jars again.
> So in this case, {{spark.jars}} will have 2 paths of duplicate copies of
> primary resource, one with the original url user submit with, the other with
> the driver local file path.
> Later when driver starts the SparkContext, it will copy all the
> {{spark.jars}} to {{spark.app.initial.jar.urls}}, and replace the driver
> local jars paths in {{spark.app.initial.jar.urls}} with driver file service
> paths.
> Now all the jars in the {{--jars}} or `spark.jars` in the original user
> submission will be replaced with a driver file service url and added to
> {{spark.app.initial.jar.urls}}. And the primary resource jar in the original
> submission will show up in {{spark.app.initial.jar.urls}} twice: one with the
> original path in the user submission, the other with a driver file service
> url.
> When executors start, they will download all the jars in the
> {{spark.app.initial.jar.urls}}.
> {*}Issues{*}:
> - The executor will download 2 duplicate copies of primary resource, one
> with the original url user submit with, the other with the driver local file
> path, which leads to resource waste.
> - When jars are big and the application requests a lot of executors, the
> massive concurrent jars download from the driver will cause network
> saturation. In this case, the executors jar download will timeout, causing
> executors to be terminated. From user point of view, the application is
> trapped in the loop of massive executor loss and re-provision, but never gets
> enough live executors as requested, which leads to job SLA breach or
> sometimes job failure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]