Github user markhamstra commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212395101
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag](
*/
object RDD {
+ /**
+ * The random level of RDD's computing function, which indicates the
behavior when rerun the
+ * computing function. There are 3 random levels, ordered by the
randomness from low to high:
+ * 1. IDEMPOTENT: The computing function always return the same result
with same order when rerun.
+ * 2. UNORDERED: The computing function returns same data set in
potentially a different order
+ * when rerun.
+ * 3. INDETERMINATE. The computing function may return totally different
result when rerun.
+ *
+ * Note that, the output of the computing function usually relies on
parent RDDs. When a
+ * parent RDD's computing function is random, it's very likely this
computing function is also
+ * random.
+ */
+ object RandomLevel extends Enumeration {
--- End diff --
I'm not completely wedded to the IDEMPOTENT, UNORDERED, INDETERMINATE
naming, so if somebody has something better or less likely to lead to
confusion, I'm fine with that.
I'd like to not use "random" in these names, though, since that implies
actually randomness at some level, entropy guarantees, etc. What is key is not
whether output values or ordering are truly random, but simply that we can't
easily determine what they are or that they are fixed and repeatable. That's
why I'd prefer that things like `RDD.RandomLevel.INDETERMINATE` be, I would
suggest, `RDD.Determinism.INDETERMINATE`, and `computingRandomLevel` should be
`computeDeterminism` (unless we want the slightly cheeky `determineDeterminism`
:) ).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]