[ 
https://issues.apache.org/jira/browse/SPARK-32553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangshengjie resolved SPARK-32553.
----------------------------------
    Resolution: Abandoned

We found other solutions to solve the problem, close this issue, thanks。

> Spark application failed due to stage fatch failed without retry
> ----------------------------------------------------------------
>
>                 Key: SPARK-32553
>                 URL: https://issues.apache.org/jira/browse/SPARK-32553
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.4, 3.0.0
>            Reporter: wangshengjie
>            Priority: Major
>              Labels: DAGScheduler, repartition
>
> We got a exception when running a spark application under spark 2.3.4 and 
> spark 3.0 using conf : *spark.shuffle.useOldFetchProtocol=true*, the 
> application failed due to stage fatch failed and the stage not retry.
> code like following:
> {code:java}
> val Array(input) = args
> val sparkConf = new SparkConf().setAppName("Spark Fatch Failed Test")
> // for running directly in IDE
> sparkConf.setIfMissing("spark.master", "local[2]")
> val sc = new SparkContext(sparkConf)
> val lines = sc.textFile(input)
>   .repartition(1)
>   .map(data => data.trim)
>   .repartition(1)
> val doc = lines.map(data => (data, 1)).reduceByKey(_ + _).collect(){code}
> The application DAG like following:
>  
> !https://i.stack.imgur.com/0TfZW.png!
> If stage 3 failed due to fatch failed,  the application will not retry stage 
> 2 and stage 3 and fail the job. Because spark think stage 2 and stage 3 are 
> non-retryable, rdds in stage 2 and stage 3 is *INDETERMINATE.*
>  
> Actually, if shuffle result belongs to stage 1 exist completely, stage 2 and 
> stage 3 are retryable, because rdds in them is not order-sensitive. If allow 
> stage 2 and stage 3 to retry, we have trouble in handling 
> *DAGScheduler.getMissingParentStages.* And i am not sure if 
> *DAGScheduler.getMissingParentStages* breaks the rule that *INDETERMINATE* 
> rdd non-retryable.
>  
> I would appreciate it if someone would reply.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to