[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

markhamstra Thu, 23 Aug 2018 10:36:35 -0700

Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r212395101
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag](
      */
     object RDD {
     
    +  /**
    +   * The random level of RDD's computing function, which indicates the 
behavior when rerun the
    +   * computing function. There are 3 random levels, ordered by the 
randomness from low to high:
    +   * 1. IDEMPOTENT: The computing function always return the same result 
with same order when rerun.
    +   * 2. UNORDERED: The computing function returns same data set in 
potentially a different order
    +   *               when rerun.
    +   * 3. INDETERMINATE. The computing function may return totally different 
result when rerun.
    +   *
    +   * Note that, the output of the computing function usually relies on 
parent RDDs. When a
    +   * parent RDD's computing function is random, it's very likely this 
computing function is also
    +   * random.
    +   */
    +  object RandomLevel extends Enumeration {
    --- End diff --
    
    I'm not completely wedded to the IDEMPOTENT, UNORDERED, INDETERMINATE 
naming, so if somebody has something better or less likely to lead to 
confusion, I'm fine with that.
    
    I'd like to not use "random" in these names, though, since that implies 
actually randomness at some level, entropy guarantees, etc. What is key is not 
whether output values or ordering are truly random, but simply that we can't 
easily determine what they are or that they are fixed and repeatable. That's 
why I'd prefer that things like `RDD.RandomLevel.INDETERMINATE` be, I would 
suggest, `RDD.Determinism.INDETERMINATE`, and `computingRandomLevel` should be 
`computeDeterminism` (unless we want the slightly cheeky `determineDeterminism` 
:) ).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to