[ https://issues.apache.org/jira/browse/SPARK-26689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750125#comment-16750125 ]
Thomas Graves commented on SPARK-26689: --------------------------------------- Can you add more details about your setup? Which resource manager were you running? > Bad disk causing broadcast failure > ---------------------------------- > > Key: SPARK-26689 > URL: https://issues.apache.org/jira/browse/SPARK-26689 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0, 2.4.0 > Environment: Spark on Yarn > Mutliple Disk > Reporter: liupengcheng > Priority: Major > > We encoutered an application failure in our production cluster which caused > by the bad disk problems. It will incur application failure. > {code:java} > Job aborted due to stage failure: Task serialization failed: > java.io.IOException: Failed to create local dir in > /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. > org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) > org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) > org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85) > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > scala.Option.foreach(Option.scala:236) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > We have multiple disk on our cluster nodes, however, it still fails. I think > it's because spark does not handle bad disk in `DiskBlockManager` currently. > Actually, we can handle bad disk in multiple disk environment to avoid > application failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org