[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Thu, 12 Jul 2018 08:53:28 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    @mridulm you provided a good example to show the indeterminacy of `zip`.
    ```
    rdd1.zip(rdd2).map(v => (computeKey(v._1, v._2), computeValue(v._1, 
v._2))).groupByKey().map().save()
    ```
    If `rdd1` or `rdd2` is unordered, then any result can be treated as a 
corrected result of this query. If `rdd1` and `rdd2` is ordered, we don't have 
a problem.
    
    On the other hand, `rdd.repartition` is very clear about what is the 
corrected result. No matter `rdd` is ordered or not, repartition will not 
add/remove/update the existing records.
    
    Basically user builds an RDD DAG and Spark should produce a result to meet 
user's expectation. For `zip` the user's expectation is: the # of output 
records of each partition is `min(# of record of the corresponding partition in 
rdd1, # of record of the corresponding partition in rdd2)`. For `map`, the 
expectation can be very vague, `zip(...).map(...)` can produce any result. For 
`repartition`, the expectation is clear: not add/remove/update the existing 
records.
    
    That's why we should fix `repartition` because it violates the user's 
expectation.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to