[GitHub] [spark] zhuqi-lucas commented on pull request #28072: [SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly

GitBox Wed, 28 Jul 2021 01:15:26 -0700


zhuqi-lucas commented on pull request #28072:
URL: https://github.com/apache/spark/pull/28072#issuecomment-888110325



   cc @xuanyuanking @cloud-fan @Ngone51  @tgravescs @dongjoon-hyun 
   
   Since this has been reverted, i meet the disk failure in our production 
clusters, how can we handle the disk failed problem without this.
   
   There are many disks in yarn clusters, but if one disk failure happend, we 
just retry the task, if we can avoid retry to the same failed disk in one node? 
Or if spark has some disk blacklist solution now?
   
   And reverted solution causes that applications with many tasks don't 
actually create shuffle files, it caused overhead, if we can get a workaround 
solution to avoid create when tasks don't need temp shuffle files, i still 
think we should handle this.
   
   The logs are: 
   
   DAGScheduler: ShuffleMapStage 521 (insertInto at Tools.scala:147) failed in 
4.995 s due to Job aborted due to stage failure: Task 30 in stage 521.0 failed 
4 times, most recent failure: Lost task 30.3 in stage 521.0 (TID 127941, 
********** 91): java.io.FileNotFoundException: 
/data2/yarn/local/usercache/aa/appcache/*****/blockmgr-eb5ca215-a7af-41be-87ee-89fd7e3b1de5/0e/temp_shuffle_45279ef1-5143-4632-9df0-d7ee1f50c026
 (Input/output error)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at 
org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
    at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
    at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
    at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhuqi-lucas commented on pull request #28072: [SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly

Reply via email to