[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

gatorsmile Tue, 28 Aug 2018 09:49:50 -0700

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r213390708
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1918,3 +1991,19 @@ object RDD {
         new DoubleRDDFunctions(rdd.map(x => num.toDouble(x)))
       }
     }
    +
    +/**
    + * The deterministic level of RDD's output (i.e. what `RDD#compute` 
returns), which indicates how
    + * the output will diff when Spark reruns the tasks for the RDD. There are 
3 deterministic levels,
    + * ordered by the determinism from high to low:
    + * 1. DETERMINATE: The RDD output is always same (including order) when 
rerun.
    + * 2. UNORDERED: The RDD output is always the same data set but in 
potentially a different order
    + *               when rerun.
    + * 3. INDETERMINATE. The RDD output can be different (not only order) when 
rerun.
    + *
    + * Note that, the output of an RDD usually relies on parent RDDs. When a 
parent RDD's output is
    + * INDETERMINATE, it's very likely this RDD's output is also INDETERMINATE.
    --- End diff --
    
    ```
     * The deterministic level of RDD's output (i.e. what `RDD#compute` 
returns). This explains how
     * the output will diff when Spark reruns the tasks for the RDD. There are 
3 deterministic levels:
     * 1. DETERMINATE: The RDD output is always the same data set in the same 
order after a rerun.
     * 2. UNORDERED: The RDD output is always the same data set but the order 
can be different
     *               after a rerun.
     * 3. INDETERMINATE. The RDD output can be different after a rerun.
     *
     * Note that, the output of an RDD usually relies on the parent RDDs. When 
the parent RDD's output
     * is INDETERMINATE, it's very likely the RDD's output is also 
INDETERMINATE.
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to