[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

cloud-fan Thu, 23 Aug 2018 07:32:53 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r212332473
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag](
       // RDD chain.
       @transient protected lazy val isBarrier_ : Boolean =
         dependencies.filter(!_.isInstanceOf[ShuffleDependency[_, _, 
_]]).exists(_.rdd.isBarrier())
    +
    +  /**
    +   * Returns the random level of this RDD's computing function. Please 
refer to [[RDD.RandomLevel]]
    +   * for the definition of random level.
    +   *
    +   * By default, an RDD without parents(root RDD) is IDEMPOTENT. For RDDs 
with parents, the random
    +   * level of current RDD is the random level of the parent which is 
random most.
    +   */
    +  // TODO: make it public so users can set random level to their custom 
RDDs.
    +  // TODO: this can be per-partition. e.g. UnionRDD can have different 
random level for different
    +  // partitions.
    +  private[spark] def computingRandomLevel: RDD.RandomLevel.Value = {
    +    val parentRandomLevels = dependencies.map {
    +      case dep: ShuffleDependency[_, _, _] =>
    +        if (dep.rdd.computingRandomLevel == RDD.RandomLevel.INDETERMINATE) 
{
    +          RDD.RandomLevel.INDETERMINATE
    --- End diff --
    
    > If checkpointed already - then always same order
    
    `RDD#dependencies` is defined as
    ```
    final def dependencies: Seq[Dependency[_]] = {
        checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
          if (dependencies_ == null) {
            dependencies_ = getDependencies
          }
          dependencies_
        }
      }
    ```
    So we don't need to handle checkpoint here.
    
    > All other shuffle cases, we dont know the output order in spark.
    
    Actually we know. As long as the shuffle map stage RDD is IDEMPOTENT or 
UNORDERED, the reduce RDD is UNORDERED instead of INDETERMINATE.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to