Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210963213 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -853,6 +861,11 @@ abstract class RDD[T: ClassTag]( * second element in each RDD, etc. Assumes that the two RDDs have the *same number of * partitions* and the *same number of elements in each partition* (e.g. one was made through * a map on the other). + * + * Note that, `zip` violates the requirement of the RDD computing function. If the order of input + * data changes, `zip` will return different result. Because of this, Spark may return unexpected + * result if there is a shuffle after `zip`, and the shuffle failed and retried. To workaround + * this, users can call `zipPartitions` and sort the input data before zip. --- End diff -- All zip method are affected by it, not just this one. I added a list of other methods I have used from memory (though unfortunately it is not exhaustive)
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org