[GitHub] spark pull request: [SPARK-4879] Use driver to coordinate Hadoop o...

JoshRosen Tue, 10 Feb 2015 14:28:25 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4066#issuecomment-73797616
  
    ([Cross-posted from 
JIRA](https://issues.apache.org/jira/browse/SPARK-4879?focusedCommentId=14315077&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14315077))
    
    This issue is really hard to reproduce, but I managed to trigger the 
original bug as part of the testing for my patch. Here's what I ran:
    
    ```bash
    ~/spark-1.3.0-SNAPSHOT-bin-1.0.4/bin/spark-shell --conf 
spark.speculation.multiplier=1 --conf spark.speculation.quantile=0.01 --conf 
spark.speculation=true --conf  
spark.hadoop.outputCommitCoordination.enabled=false
    ```
    
    ```scala
    val numTasks = 100
    val numTrials = 100
    val outputPath = "/output-committer-bug-"
    val sleepDuration = 1000
    
    for (trial <- 0 to (numTrials - 1)) {
      val outputLocation = outputPath + trial
      sc.parallelize(1 to numTasks, numTasks).mapPartitionsWithContext { case 
(ctx, iter) =>
        if (ctx.partitionId % 5 == 0) {
          if (ctx.attemptNumber == 0) {  // If this is the first attempt, run 
slow
           Thread.sleep(sleepDuration)
          }
        }
        iter
      }.map(identity).saveAsTextFile(outputLocation)
      Thread.sleep(sleepDuration * 2)
      println("TESTING OUTPUT OF TRIAL " + trial)
      val savedData = sc.textFile(outputLocation).map(_.toInt).collect()
      if (savedData.toSet != (1 to numTasks).toSet) {
        println("MISSING: " + ((1 to numTasks).toSet -- savedData.toSet))
        assert(false)
      }
      println("-" * 80)
    }
    ```
    
    It took 22 runs until I actually observed missing output partitions 
(several of the earlier runs threw spurious exceptions and didn't have missing 
outputs):
    
    ```
    [...]
    15/02/10 22:17:21 INFO scheduler.DAGScheduler: Job 66 finished: 
saveAsTextFile at <console>:39, took 2.479592 s
    15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 75.0 in stage 
66.0 (TID 6861, ip-172-31-1-124.us-west-2.compute.internal): 
java.io.IOException: Failed to save output of task: 
attempt_201502102217_0066_m_000075_6861
            at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
            at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
            at 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
            at 
org.apache.spark.SparkHadoopWriter.performCommit$1(SparkHadoopWriter.scala:113)
            at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:150)
            at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
            at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
            at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
            at org.apache.spark.scheduler.Task.run(Task.scala:64)
            at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)
    
    15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 80.0 in stage 
66.0 (TID 6866, ip-172-31-11-151.us-west-2.compute.internal): 
java.io.IOException: The temporary job-output directory 
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
 doesn't exist!
            at 
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
            at 
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
            at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
            at 
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
            at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
            at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
            at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
            at org.apache.spark.scheduler.Task.run(Task.scala:64)
            at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)
    
    15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 85.0 in stage 
66.0 (TID 6871) on executor ip-172-31-1-124.us-west-2.compute.internal: 
java.io.IOException (The temporary job-output directory 
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
 doesn't exist!) [duplicate 1]
    15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 90.0 in stage 
66.0 (TID 6876) on executor ip-172-31-1-124.us-west-2.compute.internal: 
java.io.IOException (The temporary job-output directory 
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
 doesn't exist!) [duplicate 2]
    15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 95.0 in stage 
66.0 (TID 6881) on executor ip-172-31-11-151.us-west-2.compute.internal: 
java.io.IOException (The temporary job-output directory 
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
 doesn't exist!) [duplicate 3]
    15/02/10 22:17:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 66.0, 
whose tasks have all completed, from pool
    TESTING OUTPUT OF TRIAL 22
    [...]
    15/02/10 22:17:23 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 67.0, 
whose tasks have all completed, from pool
    15/02/10 22:17:23 INFO scheduler.DAGScheduler: Stage 67 (collect at 
<console>:42) finished in 0.158 s
    15/02/10 22:17:23 INFO scheduler.DAGScheduler: Job 67 finished: collect at 
<console>:42, took 0.165580 s
    MISSING: Set(76)
    java.lang.AssertionError: assertion failed
    [...]
    ```
    
    And I confirmed that it's missing in HDFS:
    
    ```
    ~/ephemeral-hdfs/bin/hadoop fs -ls /output-committer-bug-22 | grep part | 
wc -l
    Warning: $HADOOP_HOME is deprecated.
    
    99
    ```
    
    To test my patch, I'm going to remove the flag that disables it and run 
this for a huge number of trials to ensure that we don't hit the missing output 
bug.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4879] Use driver to coordinate Hadoop o...

Reply via email to