[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

mridulm Fri, 17 Aug 2018 09:50:48 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r210963213
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -853,6 +861,11 @@ abstract class RDD[T: ClassTag](
        * second element in each RDD, etc. Assumes that the two RDDs have the 
*same number of
        * partitions* and the *same number of elements in each partition* (e.g. 
one was made through
        * a map on the other).
    +   *
    +   * Note that, `zip` violates the requirement of the RDD computing 
function. If the order of input
    +   * data changes, `zip` will return different result. Because of this, 
Spark may return unexpected
    +   * result if there is a shuffle after `zip`, and the shuffle failed and 
retried. To workaround
    +   * this, users can call `zipPartitions` and sort the input data before 
zip.
    --- End diff --
    
    All zip method are affected by it, not just this one.
    I added a list of other methods I have used from memory (though 
unfortunately it is not exhaustive)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to