Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/4066#issuecomment-73797616
([Cross-posted from
JIRA](https://issues.apache.org/jira/browse/SPARK-4879?focusedCommentId=14315077&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14315077))
This issue is really hard to reproduce, but I managed to trigger the
original bug as part of the testing for my patch. Here's what I ran:
```bash
~/spark-1.3.0-SNAPSHOT-bin-1.0.4/bin/spark-shell --conf
spark.speculation.multiplier=1 --conf spark.speculation.quantile=0.01 --conf
spark.speculation=true --conf
spark.hadoop.outputCommitCoordination.enabled=false
```
```scala
val numTasks = 100
val numTrials = 100
val outputPath = "/output-committer-bug-"
val sleepDuration = 1000
for (trial <- 0 to (numTrials - 1)) {
val outputLocation = outputPath + trial
sc.parallelize(1 to numTasks, numTasks).mapPartitionsWithContext { case
(ctx, iter) =>
if (ctx.partitionId % 5 == 0) {
if (ctx.attemptNumber == 0) { // If this is the first attempt, run
slow
Thread.sleep(sleepDuration)
}
}
iter
}.map(identity).saveAsTextFile(outputLocation)
Thread.sleep(sleepDuration * 2)
println("TESTING OUTPUT OF TRIAL " + trial)
val savedData = sc.textFile(outputLocation).map(_.toInt).collect()
if (savedData.toSet != (1 to numTasks).toSet) {
println("MISSING: " + ((1 to numTasks).toSet -- savedData.toSet))
assert(false)
}
println("-" * 80)
}
```
It took 22 runs until I actually observed missing output partitions
(several of the earlier runs threw spurious exceptions and didn't have missing
outputs):
```
[...]
15/02/10 22:17:21 INFO scheduler.DAGScheduler: Job 66 finished:
saveAsTextFile at <console>:39, took 2.479592 s
15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 75.0 in stage
66.0 (TID 6861, ip-172-31-1-124.us-west-2.compute.internal):
java.io.IOException: Failed to save output of task:
attempt_201502102217_0066_m_000075_6861
at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
at
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
at
org.apache.spark.SparkHadoopWriter.performCommit$1(SparkHadoopWriter.scala:113)
at
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:150)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 80.0 in stage
66.0 (TID 6866, ip-172-31-11-151.us-west-2.compute.internal):
java.io.IOException: The temporary job-output directory
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
doesn't exist!
at
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
at
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
at
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 85.0 in stage
66.0 (TID 6871) on executor ip-172-31-1-124.us-west-2.compute.internal:
java.io.IOException (The temporary job-output directory
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
doesn't exist!) [duplicate 1]
15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 90.0 in stage
66.0 (TID 6876) on executor ip-172-31-1-124.us-west-2.compute.internal:
java.io.IOException (The temporary job-output directory
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
doesn't exist!) [duplicate 2]
15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 95.0 in stage
66.0 (TID 6881) on executor ip-172-31-11-151.us-west-2.compute.internal:
java.io.IOException (The temporary job-output directory
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
doesn't exist!) [duplicate 3]
15/02/10 22:17:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 66.0,
whose tasks have all completed, from pool
TESTING OUTPUT OF TRIAL 22
[...]
15/02/10 22:17:23 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 67.0,
whose tasks have all completed, from pool
15/02/10 22:17:23 INFO scheduler.DAGScheduler: Stage 67 (collect at
<console>:42) finished in 0.158 s
15/02/10 22:17:23 INFO scheduler.DAGScheduler: Job 67 finished: collect at
<console>:42, took 0.165580 s
MISSING: Set(76)
java.lang.AssertionError: assertion failed
[...]
```
And I confirmed that it's missing in HDFS:
```
~/ephemeral-hdfs/bin/hadoop fs -ls /output-committer-bug-22 | grep part |
wc -l
Warning: $HADOOP_HOME is deprecated.
99
```
To test my patch, I'm going to remove the flag that disables it and run
this for a huge number of trials to ensure that we don't hit the missing output
bug.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]