[ 
https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Gupta updated SPARK-36872:
----------------------------------
    Description: During the graceful decommissioning phase, executors need to 
transfer all of their shuffle and cache data to the peer executors. However, 
they get killed before transferring all the data because of the hardcoded 
timeout value of 60 secs in the decommissioning script. As a result of 
executors dying prematurely, the spark tasks on other executors fail which 
causes application failures, and it is hard to debug those failures. To fix the 
issue, we ended up writing a custom script with a different timeout and rebuilt 
the spark image but we would prefer an easier solution that does not require 
rebuilding the image.   (was: During the graceful decommissioning phase, 
executors need to transfer all of their shuffle and cache data to the peer 
executors. However, they get killed before could transfer all the data because 
of the hardcoded timeout value of 60 secs in the decommissioning script. As a 
result of executors dying prematurely, the spark tasks on other executors fail 
which causes application failures, and it is hard to debug those failures. To 
fix the issue, we ended up writing a custom script with a different timeout and 
rebuilt the spark image but we would prefer an easier solution that does not 
require rebuilding the image. )

> Decommissioning executors get killed before transferring their data because 
> of the hardcoded timeout of 60 secs
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36872
>                 URL: https://issues.apache.org/jira/browse/SPARK-36872
>             Project: Spark
>          Issue Type: Improvement
>          Components: Kubernetes
>    Affects Versions: 3.1.1, 3.1.2, 3.2.0
>            Reporter: Shekhar Gupta
>            Priority: Trivial
>
> During the graceful decommissioning phase, executors need to transfer all of 
> their shuffle and cache data to the peer executors. However, they get killed 
> before transferring all the data because of the hardcoded timeout value of 60 
> secs in the decommissioning script. As a result of executors dying 
> prematurely, the spark tasks on other executors fail which causes application 
> failures, and it is hard to debug those failures. To fix the issue, we ended 
> up writing a custom script with a different timeout and rebuilt the spark 
> image but we would prefer an easier solution that does not require rebuilding 
> the image. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to