AngersZhuuuu opened a new pull request #26643: [SPARK-29998][CORE]Retry 
getFile() until all folder failed then exit
URL: https://github.com/apache/spark/pull/26643
 
 
   ### What changes were proposed in this pull request?
   If one NodeManager's disk is broken. when task begin to run, it will get 
jobConf by broadcast, executor's BlockManager failed to create file. and throw 
IOException.
   ```
   19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: 
"ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to 
Job aborted due to stage failure: Task 21 in st
   age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 
(TID 34968, hostname, executor 104): java.io.IOException: Failed to create 
local dir in /disk
   
11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b.
           at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
           at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129)
           at 
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605)
           at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)
           at scala.Option.getOrElse(Option.scala:121)
           at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
           at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
           at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
           at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
           at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
           at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
           at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
           at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
           at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:228)
           at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
           at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
           at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
           at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
           at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
           at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
           at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
           at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
           at org.apache.spark.scheduler.Task.run(Task.scala:121)
           at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
           at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   ```
   Since in TaskSetManager.handleFailedTask()
   For this kind of fail reason, it will retry on this Executor until 
`failedTime > maxTaskFailTime `
   Then this stage failed, total job failed.
   
   In this pr , i want to make it try all local folders, if all folder is 
broken. then exit executor.
   ### Why are the changes needed?
   This problem make job failed, we can fix it by retry.
   
   
   ### Does this PR introduce any user-facing change?
   NO
   
   ### How was this patch tested?
   WIP

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to