[jira] [Created] (SPARK-32553) Spark application failed due to stage fatch failed without retry

wangshengjie (Jira) Wed, 05 Aug 2020 18:39:07 -0700

wangshengjie created SPARK-32553:
------------------------------------

             Summary: Spark application failed due to stage fatch failed 
without retry
                 Key: SPARK-32553
                 URL: https://issues.apache.org/jira/browse/SPARK-32553
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.0, 2.3.4
            Reporter: wangshengjie



We got a exception when running a spark application under spark 2.3.4 and spark 
3.0 using conf : *spark.shuffle.useOldFetchProtocol=true*, the application 
failed due to stage fatch failed and the stage not retry.
code like following:
{code:java}
val Array(input) = args

val sparkConf = new SparkConf().setAppName("Spark Fatch Failed Test")
// for running directly in IDE
sparkConf.setIfMissing("spark.master", "local[2]")
val sc = new SparkContext(sparkConf)

val lines = sc.textFile(input)
  .repartition(1)
  .map(data => data.trim)
  .repartition(1)

val doc = lines.map(data => (data, 1)).reduceByKey(_ + _).collect(){code}
The application DAG like following:
 
!https://i.stack.imgur.com/0TfZW.png!
If stage 3 failed due to fatch failed,  the application will not retry stage 2 
and stage 3 and fail the job. Because spark think stage 2 and stage 3 are 
non-retryable, rdds in stage 2 and stage 3 is *INDETERMINATE.*
 
Actually, if shuffle result belongs to stage 1 exist completely, stage 2 and 
stage 3 are retryable, because rdds in them is not order-sensitive. If allow 
stage 2 and stage 3 to retry, we have trouble in handling 
*DAGScheduler.getMissingParentStages.* And i am not sure if 
*DAGScheduler.getMissingParentStages* breaks the rule that *INDETERMINATE* rdd 
non-retryable.
 
I would appreciate it if someone would reply.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32553) Spark application failed due to stage fatch failed without retry

Reply via email to