Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    @cloud-fan We should not look at a particular stage in isolation, but 
rather what happens when there are failures in the middle of a job with 
multiple shuffle stages - and zip is one of the internal stages.
    A synthetic example:
    `rdd1.zip(rdd2).map(v => (computeKey(v._1, v._2), computeValue(v._1, 
v._2))).groupByKey().map().save()`
    
    If relative ordering of rdd1 or rdd2 changes, the computed key would change 
- and we end up with data loss if some of the tasks in save have already 
completed.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to