[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

mridulm Thu, 23 Aug 2018 10:06:01 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r212385688
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag](
       // RDD chain.
       @transient protected lazy val isBarrier_ : Boolean =
         dependencies.filter(!_.isInstanceOf[ShuffleDependency[_, _, 
_]]).exists(_.rdd.isBarrier())
    +
    +  /**
    +   * Returns the random level of this RDD's computing function. Please 
refer to [[RDD.RandomLevel]]
    +   * for the definition of random level.
    +   *
    +   * By default, an RDD without parents(root RDD) is IDEMPOTENT. For RDDs 
with parents, the random
    +   * level of current RDD is the random level of the parent which is 
random most.
    +   */
    +  // TODO: make it public so users can set random level to their custom 
RDDs.
    +  // TODO: this can be per-partition. e.g. UnionRDD can have different 
random level for different
    +  // partitions.
    +  private[spark] def computingRandomLevel: RDD.RandomLevel.Value = {
    +    val parentRandomLevels = dependencies.map {
    +      case dep: ShuffleDependency[_, _, _] =>
    +        if (dep.rdd.computingRandomLevel == RDD.RandomLevel.INDETERMINATE) 
{
    +          RDD.RandomLevel.INDETERMINATE
    --- End diff --
    
    RE: checkpoint.
    
    I wanted to handle two cases.
    * Checkpoint is being done as part of the current job (and not a previous 
job which forced materialization of checkpoint'ed RDD).
    * Checkpoint is happening to reliable store, not local - where we are 
subject to failures on node failures.
    
    Looks like `dep.rdd.isCheckpointed` is the wrong way to go about it 
(relying on `dependencies` is insufficient for both cases).
    
    A better option seems to be:
    ```
      // If checkpointed already - then always same order
      case dep: Dependency if dep.rdd.getCheckpointFile.isDefined => 
RDD.RandomLevel.IDEMPOTENT
    ```
    
    > Actually we know. As long as the shuffle map stage RDD is IDEMPOTENT or 
UNORDERED, the reduce RDD is UNORDERED instead of INDETERMINATE.
    
    It does not matter what the output order of map stage was, after we shuffle 
the map output, it is always indeterminate order except for the specific cases 
I referred to above.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to