Shivaram Venkataraman created SPARK-2723:
--------------------------------------------
Summary: Block Manager should catch exceptions in putValues
Key: SPARK-2723
URL: https://issues.apache.org/jira/browse/SPARK-2723
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman
The BlockManager should catch exceptions encountered while writing out files to
disk. Right now these exceptions get counted as user-level task failures and
the job is aborted after failing 4 times. We should either fail the executor or
handle this better to prevent the job from dying.
I ran into an issue where one disk on a large EC2 cluster failed and this
resulted in a long running job terminating. Longer term, we should also look at
black-listing local directories when one of them become unusable ?
Exception pasted below:
14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException:
/mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20
(Input/output error)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79)
at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66)
at
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847)
at
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267)
at
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256)
at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
--
This message was sent by Atlassian JIRA
(v6.2#6252)