[
https://issues.apache.org/jira/browse/SPARK-43407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-43407.
----------------------------------
Resolution: Invalid
Let's ask questions into Spark user mailing list
(https://spark.apache.org/community.html). You'd be able to get a better answer
there.
> Can executors recover/reuse shuffle files upon failure?
> -------------------------------------------------------
>
> Key: SPARK-43407
> URL: https://issues.apache.org/jira/browse/SPARK-43407
> Project: Spark
> Issue Type: Question
> Components: Spark Core
> Affects Versions: 3.3.1
> Reporter: Faiz Halde
> Priority: Minor
>
> Hello,
> We've been in touch with a few spark specialists who suggested us a potential
> solution to improve the reliability of our jobs that are shuffle heavy
> Here is what our setup looks like
> * Spark version: 3.3.1
> * Java version: 1.8
> * We do not use external shuffle service
> * We use spot instances
> We run spark jobs on clusters that use Amazon EBS volumes. The
> spark.local.dir is mounted on this EBS volume. One of the offerings from the
> service we use is EBS migration which basically means if a host is about to
> get evicted, a new host is created and the EBS volume is attached to it
> When Spark assigns a new executor to the newly created instance, it basically
> can recover all the shuffle files that are already persisted in the migrated
> EBS volume
> Is this how it works? Do executors recover / re-register the shuffle files
> that they found?
> So far I have not come across any recovery mechanism. I can only see
> {noformat}
> KubernetesLocalDiskShuffleDataIO{noformat}
> that has a pre-init step where it tries to register the available shuffle
> files to itself
> A natural follow-up on this,
> If what they claim is true, then ideally we should expect that when an
> executor is killed/OOM'd and a new executor is spawned on the same host, the
> new executor registers the shuffle files to itself. Is that so?
> Thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]