Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210963665 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag]( // From performance concern, cache the value to avoid repeatedly compute `isBarrier()` on a long // RDD chain. @transient protected lazy val isBarrier_ : Boolean = dependencies.exists(_.rdd.isBarrier()) + + /** + * Whether the RDD's computing function is idempotent. Idempotent means the computing function + * not only satisfies the requirement, but also produce the same output sequence(the output order + * can't vary) given the same input sequence. Spark assumes all the RDDs are idempotent, except + * for the shuffle RDD and RDDs derived from non-idempotent RDD. + */ --- End diff -- This will mean all rdd's which are directly or indirectly reading from an unsorted shuffle output are not 'idempotent'.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org