[ 
https://issues.apache.org/jira/browse/SPARK-55974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-55974:
------------------------------
    Description: 
In YARN mode, executors can get stuck during launch (e.g., slow node, resource 
contention, network issues). Without a timeout, the AM keeps waiting 
indefinitely, which can:
 * Block progress when executors never register.
 * Prevent new executors from being requested.
 * Cause jobs to hang or run with fewer executors than expected.

This change adds a configurable timeout so the AM can detect stuck launches and 
request replacement executors, improving reliability and resource utilization.

> Relaunch new executors if the executor launching take too long time
> -------------------------------------------------------------------
>
>                 Key: SPARK-55974
>                 URL: https://issues.apache.org/jira/browse/SPARK-55974
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, YARN
>    Affects Versions: 3.2.4, 3.5.8, 4.1.1
>            Reporter: angerszhu
>            Priority: Major
>              Labels: pull-request-available
>
> In YARN mode, executors can get stuck during launch (e.g., slow node, 
> resource contention, network issues). Without a timeout, the AM keeps waiting 
> indefinitely, which can:
>  * Block progress when executors never register.
>  * Prevent new executors from being requested.
>  * Cause jobs to hang or run with fewer executors than expected.
> This change adds a configurable timeout so the AM can detect stuck launches 
> and request replacement executors, improving reliability and resource 
> utilization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to