Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    Sorry for coming in late on this, first I saw this was the other day.  
    
    Could someone perhaps summarize the discussions here and exactly when this 
happens and why? Checkpointing was mentioned to work around the issue, why?  
Would be good to add those details to the jira anyway. 
    
    My initial reaction is this is very bad.  Any correctness issue we cause 
from handle failures is not something we should write off and expect the user 
to handle. 
    repartition seems to be the most obvious case and I know lots of people use 
it, although hopefully many are using the dataframe api) and we see fetch 
failures on large jobs all the time, so it seems really serious.
    
    Trying to use a similar example as what is listed in jira SPARK-23207 with 
an RDD doesn't reproduce this:
    
    ```
    import scala.sys.process._
    
    import org.apache.spark.TaskContext
    val res = sc.parallelize(0 to (1000000-1), 1).repartition(200).map { x =>
      x
    }.repartition(200).map { x =>
      if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 
2) {
        throw new Exception("pkill -f java".!!)
      }
      x
    }
    res.distinct().count()
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to