[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

mridulm Wed, 11 Jul 2018 02:04:34 -0700

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    @cloud-fan We should not look at a particular stage in isolation, but 
rather what happens when there are failures in the middle of a job with 
multiple shuffle stages - and zip is one of the internal stages.
    A synthetic example:
    `rdd1.zip(rdd2).map(v => (computeKey(v._1, v._2), computeValue(v._1, 
v._2))).groupByKey().map().save()`
    
    If relative ordering of rdd1 or rdd2 changes, the computed key would change 
- and we end up with data loss if some of the tasks in save have already 
completed.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to