Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212205748 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag]( */ object RDD { + /** + * The random level of RDD's computing function, which indicates the behavior when rerun the + * computing function. There are 3 random levels, ordered by the randomness from low to high: + * 1. IDEMPOTENT: The computing function always return the same result with same order when rerun. + * 2. UNORDERED: The computing function returns same data set in potentially a different order + * when rerun. + * 3. INDETERMINATE. The computing function may return totally different result when rerun. + * + * Note that, the output of the computing function usually relies on parent RDDs. When a + * parent RDD's computing function is random, it's very likely this computing function is also + * random. + */ + object RandomLevel extends Enumeration { --- End diff -- You are right about this unclearness. What Spark cares about is the output of an RDD partition(what `RDD#compute` returns) when rerun, the RDD may be a root RDD that don't have a closure, or may be a mapped RDD, or something else, but this doesn't matter. When Spark executes a chain of RDDs, it only cares about the `RandomLevel` of the last RDD, and RDDs are responsible to propagate this information from the root RDD to the last RDD. In general, an RDD should have a property to indicate its output behavior when rerun, and some RDDs can define some other methods to help to propagate the `RandomLevel` property. (like the `orderSensitiveFunc` flag in MappedRDD). How about ``` object OutputDifferWhenRerun extends Enumeration { val EXACTLY_SAME, DIFFERENT_ORDER, TOTALLY_DIFFERENT = Value } ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org