[ https://issues.apache.org/jira/browse/SPARK-26907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-26907. ---------------------------------- Resolution: Invalid > Does ShuffledRDD Replication Work With External Shuffle Service > --------------------------------------------------------------- > > Key: SPARK-26907 > URL: https://issues.apache.org/jira/browse/SPARK-26907 > Project: Spark > Issue Type: Question > Components: Block Manager, YARN > Affects Versions: 2.3.2 > Reporter: Han Altae-Tran > Priority: Major > > I am interested in working with high replication environments for extreme > fault tolerance (e.g. 10x replication), but have noticed that when using > groupBy or groupWith followed by persist (with 10x replication), even if one > node fails, the entire stage can fail with FetchFailedException. > > Is this because the External Shuffle Service writes and services intermediate > shuffle data only to/from the local disk attached to the executor that > generated it, causing spark to ignore possible replicated shuffle data (from > the persist) that may be serviced elsewhere? If so, is there any way to > increase the replication factor of the External Shuffle Service to make it > fault tolerant? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org