[jira] [Resolved] (SPARK-43407) Can executors recover/reuse shuffle files upon failure?

Hyukjin Kwon (Jira) Sun, 14 May 2023 18:56:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-43407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-43407.
----------------------------------
    Resolution: Invalid

Let's ask questions into Spark user mailing list 
(https://spark.apache.org/community.html). You'd be able to get a better answer 
there.

> Can executors recover/reuse shuffle files upon failure?
> -------------------------------------------------------
>
>                 Key: SPARK-43407
>                 URL: https://issues.apache.org/jira/browse/SPARK-43407
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 3.3.1
>            Reporter: Faiz Halde
>            Priority: Minor
>
> Hello,
> We've been in touch with a few spark specialists who suggested us a potential 
> solution to improve the reliability of our jobs that are shuffle heavy
> Here is what our setup looks like
>  * Spark version: 3.3.1
>  * Java version: 1.8
>  * We do not use external shuffle service
>  * We use spot instances
> We run spark jobs on clusters that use Amazon EBS volumes. The 
> spark.local.dir is mounted on this EBS volume. One of the offerings from the 
> service we use is EBS migration which basically means if a host is about to 
> get evicted, a new host is created and the EBS volume is attached to it
> When Spark assigns a new executor to the newly created instance, it basically 
> can recover all the shuffle files that are already persisted in the migrated 
> EBS volume
> Is this how it works? Do executors recover / re-register the shuffle files 
> that they found?
> So far I have not come across any recovery mechanism. I can only see 
> {noformat}
> KubernetesLocalDiskShuffleDataIO{noformat}
>  that has a pre-init step where it tries to register the available shuffle 
> files to itself
> A natural follow-up on this,
> If what they claim is true, then ideally we should expect that when an 
> executor is killed/OOM'd and a new executor is spawned on the same host, the 
> new executor registers the shuffle files to itself. Is that so?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-43407) Can executors recover/reuse shuffle files upon failure?

Reply via email to