[
https://issues.apache.org/jira/browse/SPARK-36872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482861#comment-17482861
]
Abhishek Rao commented on SPARK-36872:
--------------------------------------
[~shkhrgpt], could you please share more details on which script you are
referring to? We're facing similar issues and we're looking for options to fix
this.
> Decommissioning executors get killed before transferring their data because
> of the hardcoded timeout of 60 secs
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-36872
> URL: https://issues.apache.org/jira/browse/SPARK-36872
> Project: Spark
> Issue Type: Improvement
> Components: Kubernetes
> Affects Versions: 3.1.1, 3.1.2, 3.2.0
> Reporter: Shekhar Gupta
> Priority: Trivial
>
> During the graceful decommissioning phase, executors need to transfer all of
> their shuffle and cache data to the peer executors. However, they get killed
> before transferring all the data because of the hardcoded timeout value of 60
> secs in the decommissioning script. As a result of executors dying
> prematurely, the spark tasks on other executors fail which causes application
> failures, and it is hard to debug those failures. To fix the issue, we ended
> up writing a custom script with a different timeout and rebuilt the spark
> image but we would prefer an easier solution that does not require rebuilding
> the image.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]