wangshengjie created SPARK-32553: ------------------------------------ Summary: Spark application failed due to stage fatch failed without retry Key: SPARK-32553 URL: https://issues.apache.org/jira/browse/SPARK-32553 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0, 2.3.4 Reporter: wangshengjie
We got a exception when running a spark application under spark 2.3.4 and spark 3.0 using conf : *spark.shuffle.useOldFetchProtocol=true*, the application failed due to stage fatch failed and the stage not retry. code like following: {code:java} val Array(input) = args val sparkConf = new SparkConf().setAppName("Spark Fatch Failed Test") // for running directly in IDE sparkConf.setIfMissing("spark.master", "local[2]") val sc = new SparkContext(sparkConf) val lines = sc.textFile(input) .repartition(1) .map(data => data.trim) .repartition(1) val doc = lines.map(data => (data, 1)).reduceByKey(_ + _).collect(){code} The application DAG like following: !https://i.stack.imgur.com/0TfZW.png! If stage 3 failed due to fatch failed, the application will not retry stage 2 and stage 3 and fail the job. Because spark think stage 2 and stage 3 are non-retryable, rdds in stage 2 and stage 3 is *INDETERMINATE.* Actually, if shuffle result belongs to stage 1 exist completely, stage 2 and stage 3 are retryable, because rdds in them is not order-sensitive. If allow stage 2 and stage 3 to retry, we have trouble in handling *DAGScheduler.getMissingParentStages.* And i am not sure if *DAGScheduler.getMissingParentStages* breaks the rule that *INDETERMINATE* rdd non-retryable. I would appreciate it if someone would reply. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org