[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

cloud-fan Thu, 23 Aug 2018 00:22:51 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r212205748
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag](
      */
     object RDD {
     
    +  /**
    +   * The random level of RDD's computing function, which indicates the 
behavior when rerun the
    +   * computing function. There are 3 random levels, ordered by the 
randomness from low to high:
    +   * 1. IDEMPOTENT: The computing function always return the same result 
with same order when rerun.
    +   * 2. UNORDERED: The computing function returns same data set in 
potentially a different order
    +   *               when rerun.
    +   * 3. INDETERMINATE. The computing function may return totally different 
result when rerun.
    +   *
    +   * Note that, the output of the computing function usually relies on 
parent RDDs. When a
    +   * parent RDD's computing function is random, it's very likely this 
computing function is also
    +   * random.
    +   */
    +  object RandomLevel extends Enumeration {
    --- End diff --
    
    You are right about this unclearness. What Spark cares about is the output 
of an RDD partition(what `RDD#compute` returns) when rerun, the RDD may be a 
root RDD that don't have a closure, or may be a mapped RDD, or something else, 
but this doesn't matter.
    
    When Spark executes a chain of RDDs, it only cares about the `RandomLevel` 
of the last RDD, and RDDs are responsible to propagate this information from 
the root RDD to the last RDD.
    
    In general, an RDD should have a property to indicate its output behavior 
when rerun, and some RDDs can define some other methods to help to propagate 
the `RandomLevel` property. (like the `orderSensitiveFunc` flag in MappedRDD).
    
    How about
    ```
    object OutputDifferWhenRerun extends Enumeration {
      val EXACTLY_SAME, DIFFERENT_ORDER, TOTALLY_DIFFERENT = Value
    }
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to