[jira] [Updated] (SPARK-47475) Jar Download Under K8s Cluster Mode Causes Executors Scaling Issues

ASF GitHub Bot (Jira) Wed, 20 Mar 2024 14:59:14 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated SPARK-47475:
-----------------------------------
    Labels: pull-request-available  (was: )

> Jar Download Under K8s Cluster Mode Causes Executors Scaling Issues 
> --------------------------------------------------------------------
>
>                 Key: SPARK-47475
>                 URL: https://issues.apache.org/jira/browse/SPARK-47475
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Kubernetes, Spark Core
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Jiale Tan
>            Priority: Major
>              Labels: pull-request-available
>
> {*}Context{*}:
> To submit spark jobs to Kubernetes under cluster mode, the {{spark-submit}} 
> will be triggered twice. 
> The first time {{SparkSubmit}} will run under k8s cluster mode, it will 
> append primary resource to {{spark.jars}} and call 
> {{KubernetesClientApplication::start}} to create a driver pod. 
> The driver pod will run {{spark-submit}} again with the same primary resource 
> jar. However this time the {{SparkSubmit}} will run under client mode with 
> {{spark.kubernetes.submitInDriver}} as {{true}}, plus the updated 
> {{spark.jars}}. Under this mode, {{SparkSubmit}} will download all the jars 
> in {{spark.jars}} to driver and those {{spark.jars}} urls will be replaced by 
> the driver local paths. 
> Then SparkSubmit will append the same primary resource to spark.jars again. 
> So in this case, {{spark.jars}} will have 2 paths of duplicate copies of 
> primary resource, one with the original url user submit with, the other with 
> the driver local file path. 
> Later when driver starts the SparkContext, it will copy all the 
> {{spark.jars}} to {{spark.app.initial.jar.urls}}, and replace the driver 
> local jars paths in {{spark.app.initial.jar.urls}} with driver file service 
> paths. 
> Now all the jars in the {{--jars}} or `spark.jars` in the original user 
> submission will be replaced with a driver file service url and added to  
> {{spark.app.initial.jar.urls}}. And the primary resource jar in the original 
> submission will show up in {{spark.app.initial.jar.urls}} twice: one with the 
> original path in the user submission, the other with a driver file service 
> url.
> When executors start, they will download all the jars in the 
> {{spark.app.initial.jar.urls}}. 
> {*}Issues{*}:
>  - The executor will download 2 duplicate copies of primary resource, one 
> with the original url user submit with, the other with the driver local file 
> path, which leads to resource waste. This is also reported previously 
> [here|https://github.com/apache/spark/pull/37417#issuecomment-1517797912].
>  - When jars are big and the application requests a lot of executors, the 
> massive concurrent jars download from the driver will cause network 
> saturation. In this case, the executors jar download will timeout, causing 
> executors to be terminated. From user point of view, the application is 
> trapped in the loop of massive executor loss and re-provision, but never gets 
> enough live executors as requested, which leads to job SLA breach or 
> sometimes job failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-47475) Jar Download Under K8s Cluster Mode Causes Executors Scaling Issues

Reply via email to