[jira] [Created] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs

Shekhar Gupta (Jira) Mon, 27 Sep 2021 19:42:04 -0700

Shekhar Gupta created SPARK-36872:
-------------------------------------

             Summary: Decommissioning executors get killed before transferring 
their data because of the hardcoded timeout of 60 secs
                 Key: SPARK-36872
                 URL: https://issues.apache.org/jira/browse/SPARK-36872
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: 3.1.2, 3.1.1, 3.2.0
            Reporter: Shekhar Gupta



During the graceful decommissioning phase, executors need to transfer all of 
their shuffle and cache data to the peer executors. However, they get killed 
before could transfer all the data because of the hardcoded timeout value of 60 
secs in the decommissioning script. As a result of executors dying prematurely, 
the spark tasks on other executors fail which causes application failures, and 
it is hard to debug those failures. To fix the issue, we ended up writing a 
custom script with a different timeout and rebuilt the spark image but we would 
prefer an easier solution that does not require rebuilding the image. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-36872) Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs

Reply via email to