AngersZhuuuu opened a new pull request #26643: [SPARK-29998][CORE]Retry getFile() until all folder failed then exit URL: https://github.com/apache/spark/pull/26643 ### What changes were proposed in this pull request? If one NodeManager's disk is broken. when task begin to run, it will get jobConf by broadcast, executor's BlockManager failed to create file. and throw IOException. ``` 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to Job aborted due to stage failure: Task 21 in st age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 (TID 34968, hostname, executor 104): java.io.IOException: Failed to create local dir in /disk 11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b. at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70) at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129) at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:228) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Since in TaskSetManager.handleFailedTask() For this kind of fail reason, it will retry on this Executor until `failedTime > maxTaskFailTime ` Then this stage failed, total job failed. In this pr , i want to make it try all local folders, if all folder is broken. then exit executor. ### Why are the changes needed? This problem make job failed, we can fix it by retry. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? WIP
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
