Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212205748
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag](
*/
object RDD {
+ /**
+ * The random level of RDD's computing function, which indicates the
behavior when rerun the
+ * computing function. There are 3 random levels, ordered by the
randomness from low to high:
+ * 1. IDEMPOTENT: The computing function always return the same result
with same order when rerun.
+ * 2. UNORDERED: The computing function returns same data set in
potentially a different order
+ * when rerun.
+ * 3. INDETERMINATE. The computing function may return totally different
result when rerun.
+ *
+ * Note that, the output of the computing function usually relies on
parent RDDs. When a
+ * parent RDD's computing function is random, it's very likely this
computing function is also
+ * random.
+ */
+ object RandomLevel extends Enumeration {
--- End diff --
You are right about this unclearness. What Spark cares about is the output
of an RDD partition(what `RDD#compute` returns) when rerun, the RDD may be a
root RDD that don't have a closure, or may be a mapped RDD, or something else,
but this doesn't matter.
When Spark executes a chain of RDDs, it only cares about the `RandomLevel`
of the last RDD, and RDDs are responsible to propagate this information from
the root RDD to the last RDD.
In general, an RDD should have a property to indicate its output behavior
when rerun, and some RDDs can define some other methods to help to propagate
the `RandomLevel` property. (like the `orderSensitiveFunc` flag in MappedRDD).
How about
```
object OutputDifferWhenRerun extends Enumeration {
val EXACTLY_SAME, DIFFERENT_ORDER, TOTALLY_DIFFERENT = Value
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]