[ https://issues.apache.org/jira/browse/SPARK-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537914#comment-14537914 ]
Dibyendu Bhattacharya commented on SPARK-7525: ---------------------------------------------- This issue not happening when chekpointing to HDFS . If Checkpoint directory is Tachyon , then this issue comes while Receiver fails . For Driver failure case, Spark Streaming can recover if checkpoint directory is in Tachyon .. > Could not read data from write ahead log record when Receiver failed and WAL > is stored in Tachyon > ------------------------------------------------------------------------------------------------- > > Key: SPARK-7525 > URL: https://issues.apache.org/jira/browse/SPARK-7525 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.4.0 > Environment: AWS , Spark Streaming 1.4 with Tachyon 0.6.4 > Reporter: Dibyendu Bhattacharya > > I was testing Fault Tolerant aspect of Spark Streaming when Checkpoint > directory is stored in Tachyon. Spark Streaming is able to recover from > Driver failure , but when Receiver Failed, Spark Streaming not able read the > WAL files written by failed Receiver. Below is exception when Receiver is > failed . > INFO : org.apache.spark.scheduler.DAGScheduler - Executor lost: 2 (epoch 1) > INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove > executor 2 from BlockManagerMaster. > INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block > manager BlockManagerId(2, 10.252.5.54, 45789) > INFO : org.apache.spark.storage.BlockManagerMaster - Removed 2 successfully > in removeExecutor > INFO : org.apache.spark.streaming.scheduler.ReceiverTracker - Registered > receiver for stream 2 from 10.252.5.62:47255 > WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 2.1 in stage > 103.0 (TID 421, 10.252.5.62): org.apache.spark.SparkException: Could not read > data from write ahead log record > FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:144) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.IllegalArgumentException: Seek position is past EOF: > 645603894, fileSize = 0 > at tachyon.hadoop.HdfsFileInputStream.seek(HdfsFileInputStream.java:239) > at > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:104) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:141) > ... 15 more > INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 2.2 in stage > 103.0 (TID 422, 10.252.5.61, ANY, 1909 bytes) > INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 2.2 in stage > 103.0 (TID 422) on executor 10.252.5.61: org.apache.spark.SparkException > (Could not read data from write ahead log record > FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919)) > [duplicate 1] > INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 2.3 in stage > 103.0 (TID 423, 10.252.5.62, ANY, 1909 bytes) > INFO : org.apache.spark.deploy.client.AppClient$ClientActor - Executor > updated: app-20150511104442-0048/2 is now LOST (worker lost) > INFO : org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - > Executor app-20150511104442-0048/2 removed: worker lost > ERROR: org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Asked > to remove non-existent executor 2 > INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 2.3 in stage > 103.0 (TID 423) on executor 10.252.5.62: org.apache.spark.SparkException > (Could not read data from write ahead log record > FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919)) > [duplicate 2] > ERROR: org.apache.spark.scheduler.TaskSetManager - Task 2 in stage 103.0 > failed 4 times; aborting job > INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 103.0, > whose tasks have all completed, from pool > INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Cancelling stage 103 > INFO : org.apache.spark.scheduler.DAGScheduler - ResultStage 103 (foreachRDD > at Consumer.java:92) failed in 0.943 s > INFO : org.apache.spark.scheduler.DAGScheduler - Job 120 failed: foreachRDD > at Consumer.java:92, took 0.953482 s > ERROR: org.apache.spark.streaming.scheduler.JobScheduler - Error running job > streaming job 1431341145000 ms.0 > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 103.0 failed 4 times, most recent failure: Lost task 2.3 in stage 103.0 > (TID 423, 10.252.5.62): org.apache.spark.SparkException: Could not read data > from write ahead log record > FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:144) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.IllegalArgumentException: Seek position is past EOF: > 645603894, fileSize = 0 > at tachyon.hadoop.HdfsFileInputStream.seek(HdfsFileInputStream.java:239) > at > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:104) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:141) > ... 15 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 2 in stage 103.0 failed 4 times, most recent failure: > Lost task 2.3 in stage 103.0 (TID 423, 10.252.5.62): > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:144) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.IllegalArgumentException: Seek position is past EOF: > 645603894, fileSize = 0 > at tachyon.hadoop.HdfsFileInputStream.seek(HdfsFileInputStream.java:239) > at > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:104) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:141) > ... 15 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org