[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

mridulm Fri, 17 Aug 2018 09:50:26 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r210964794
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag](
       // From performance concern, cache the value to avoid repeatedly compute 
`isBarrier()` on a long
       // RDD chain.
       @transient protected lazy val isBarrier_ : Boolean = 
dependencies.exists(_.rdd.isBarrier())
    +
    +  /**
    +   * Whether the RDD's computing function is idempotent. Idempotent means 
the computing function
    +   * not only satisfies the requirement, but also produce the same output 
sequence(the output order
    +   * can't vary) given the same input sequence. Spark assumes all the RDDs 
are idempotent, except
    +   * for the shuffle RDD and RDDs derived from non-idempotent RDD.
    +   */
    +  // TODO: Add public APIs to allow users to mark their RDD as 
non-idempotent.
    +  // TODO: this can be per-partition. e.g. UnionRDD can have part of its 
partitions idempotent.
    +  private[spark] def isIdempotent: Boolean = {
    +    dependencies.forall { dep =>
    +      // Shuffle RDD is always considered as non-idempotent, because its 
computing function needs
    +      // to fetch remote shuffle blocks, and these fetched blocks may 
arrive in a random order.
    +      !dep.isInstanceOf[ShuffleDependency[_, _, _]] && dep.rdd.isIdempotent
    --- End diff --
    
    This is too strict.
    As I discussed with @jiangxb1987 , something like this would be better:
    ```
      dep =>
        dep match {
          case shuffleDep: ShuffleDependency[_, _, _] => 
shuffleDep.keyOrdering.isDefined
          // IIRC this is not comprehensive if checkpoint is happening as part 
of this job.
          case checkpointedDep: Dependency[_] if 
checkpointedDep.rdd.isCheckpointed => true
          case _ => dep.rdd.isIdempotent
        }
    ```
    
    Note that this method can end up with stack overflow error's - please refer 
to `DAGScheduler.stageDependsOn` which does a similar dependency traveral (but 
for different purpose).




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to