[ 
https://issues.apache.org/jira/browse/SPARK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030480#comment-14030480
 ] 

Wenchen Fan commented on SPARK-2133:
------------------------------------

The exception happened when spark want to write a shuffle file. According to 
the shuffle file name, I guess you have not enabled the shuffle files 
consolidation so spark is going to create a new file at that time. The document 
of java.io.FileOutputStream.<init> says it will throw FileNotFoundException  if 
the file exists but is a directory rather than a regular file, does not exist 
but cannot be created, or cannot be opened for any other reason. So I think the 
reason for your case is "file does not exist but cannot be created". Do you 
make sure the disk mounted to your spark data dir can work well? Can your file 
system handle a lot of files? Could you enable shuffle file consolidation and 
try again?

> FileNotFoundException in BlockObjectWriter
> ------------------------------------------
>
>                 Key: SPARK-2133
>                 URL: https://issues.apache.org/jira/browse/SPARK-2133
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>         Environment: YARN
>            Reporter: Neville Li
>
> Seeing a lot of this when running ALS on large data sets (> 50GB) and YARN. 
> The job eventually fails after spark.task.maxFailures has been reached.
> {code}
> Exception in thread "Thread-3" java.lang.reflect.InvocationTargetException
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1.0:677 failed 10 times, most recent failure: Exception failure in TID 
> 946 on host lon4-hadoopslave-b501.lon4.spotify.net: 
> java.io.FileNotFoundException: 
> /disk/hd01/yarn/local/usercache/neville/appcache/application_1401944843353_36952/spark-local-20140611033053-6b18/2a/shuffle_0_677_985
>  (No such file or directory)
>         java.io.FileOutputStream.openAppend(Native Method)
>         java.io.FileOutputStream.<init>(FileOutputStream.java:192)
>         
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
>         
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
>         
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
>         
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
>         scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>         
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         org.apache.spark.scheduler.Task.run(Task.scala:51)
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>       at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>       at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>       at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>       at scala.Option.foreach(Option.scala:236)
>       at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to