[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

jiangxb1987 Thu, 23 Aug 2018 09:59:00 -0700

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r212376990
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag](
       // RDD chain.
       @transient protected lazy val isBarrier_ : Boolean =
         dependencies.filter(!_.isInstanceOf[ShuffleDependency[_, _, 
_]]).exists(_.rdd.isBarrier())
    +
    +  /**
    +   * Returns the random level of this RDD's computing function. Please 
refer to [[RDD.RandomLevel]]
    +   * for the definition of random level.
    +   *
    +   * By default, an RDD without parents(root RDD) is IDEMPOTENT. For RDDs 
with parents, the random
    +   * level of current RDD is the random level of the parent which is 
random most.
    +   */
    +  // TODO: make it public so users can set random level to their custom 
RDDs.
    +  // TODO: this can be per-partition. e.g. UnionRDD can have different 
random level for different
    +  // partitions.
    +  private[spark] def computingRandomLevel: RDD.RandomLevel.Value = {
    +    val parentRandomLevels = dependencies.map {
    +      case dep: ShuffleDependency[_, _, _] =>
    +        if (dep.rdd.computingRandomLevel == RDD.RandomLevel.INDETERMINATE) 
{
    +          RDD.RandomLevel.INDETERMINATE
    --- End diff --
    
    > > All other shuffle cases, we dont know the output order in spark.
    > 
    > Actually we know. As long as the shuffle map stage RDD is IDEMPOTENT or 
UNORDERED, the reduce RDD is UNORDERED instead of INDETERMINATE.
    
    IIUC shuffle map itself works as follows:
    
    - If Aggregator and key ordering are specified:
      - output becomes idempotent;
    - If Aggregator or key ordering are not specified:
      - If input is indeterminate, then output becomes indeterminate;
      - If input is idempotent or unordered, then output becomes unordered.
    
    We have to also include the case @mridulm raised that shuffle map may be 
skipped.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to