[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

mridulm Fri, 17 Aug 2018 09:50:48 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r210963665
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag](
       // From performance concern, cache the value to avoid repeatedly compute 
`isBarrier()` on a long
       // RDD chain.
       @transient protected lazy val isBarrier_ : Boolean = 
dependencies.exists(_.rdd.isBarrier())
    +
    +  /**
    +   * Whether the RDD's computing function is idempotent. Idempotent means 
the computing function
    +   * not only satisfies the requirement, but also produce the same output 
sequence(the output order
    +   * can't vary) given the same input sequence. Spark assumes all the RDDs 
are idempotent, except
    +   * for the shuffle RDD and RDDs derived from non-idempotent RDD.
    +   */
    --- End diff --
    
    This will mean all rdd's which are directly or indirectly reading from an 
unsorted shuffle output are not 'idempotent'.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to