[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

markhamstra Mon, 27 Aug 2018 11:01:01 -0700

Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r213061324
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1918,3 +1980,19 @@ object RDD {
         new DoubleRDDFunctions(rdd.map(x => num.toDouble(x)))
       }
     }
    +
    +/**
    + * The random level of RDD's output (i.e. what `RDD#compute` returns), 
which indicates how the
    + * output will diff when Spark reruns the tasks for the RDD. There are 3 
random levels, ordered
    + * by the randomness from low to high:
    --- End diff --
    
    Again, please remove "random" and "randomness". The issue is not 
randomness, but rather determinism. For example, the output of `RDD#compute` 
could be completely non-random but still dependent on state not contained in 
the RDD. That would still make it problematic in terms of recomputing only some 
partitions and aggregating the results.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to