[GitHub] spark pull request: [SPARK-9703] [SQL] Refactor EnsureRequirements...

cloud-fan Sun, 09 Aug 2015 23:37:40 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7988#discussion_r36604936
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
    @@ -90,9 +121,66 @@ sealed trait Partitioning {
       /**
        * Returns true iff we can say that the partitioning scheme of this 
[[Partitioning]]
        * guarantees the same partitioning scheme described by `other`.
    +   *
    +   * Compatibility of partitionings is only checked for operators that 
have multiple children
    +   * and that require a specific child output [[Distribution]], such as 
joins.
    +   *
    +   * Intuitively, partitionings are compatible if they route the same 
partitioning key to the same
    +   * partition. For instance, two hash partitionings are only compatible 
if they produce the same
    +   * number of output partitionings and hash records according to the same 
hash function and
    +   * same partitioning key schema.
    +   *
    +   * Put another way, two partitionings are compatible with each other if 
they satisfy all of the
    +   * same distribution guarantees.
        */
    -  // TODO: Add an example once we have the `nullSafe` concept.
    -  def guarantees(other: Partitioning): Boolean
    +  def compatibleWith(other: Partitioning): Boolean
    +
    +  /**
    +   * Returns true iff we can say that the partitioning scheme of this 
[[Partitioning]] guarantees
    +   * the same partitioning scheme described by `other`. If a 
`A.guarantees(B)`, then repartitioning
    +   * the child's output according to `B` will be unnecessary. `guarantees` 
is used as a performance
    +   * optimization to allow the exchange planner to avoid redundant 
repartitionings. By default,
    +   * a partitioning only guarantees partitionings that are equal to itself 
(i.e. the same number
    +   * of partitions, same strategy (range or hash), etc).
    +   *
    +   * In order to enable more aggressive optimization, this strict equality 
check can be relaxed.
    +   * For example, say that the planner needs to repartition all of an 
operator's children so that
    +   * they satisfy the [[AllTuples]] distribution. One way to do this is to 
repartition all children
    +   * to have the [[SinglePartition]] partitioning. If one of the 
operator's children already happens
    +   * to be hash-partitioned with a single partition then we do not need to 
re-shuffle this child;
    +   * this repartitioning can be avoided if a single-partition 
[[HashPartitioning]] `guarantees`
    +   * [[SinglePartition]].
    +   *
    +   * The SinglePartition example given above is not particularly 
interesting; guarantees' real
    +   * value occurs for more advanced partitioning strategies. SPARK-7871 
will introduce a notion
    +   * of null-safe partitionings, under which partitionings can specify 
whether rows whose
    +   * partitioning keys contain null values will be grouped into the same 
partition or whether they
    +   * will have an unknown / random distribution. If a partitioning does 
not require nulls to be
    +   * clustered then a partitioning which _does_ cluster nulls will 
guarantee the null clustered
    +   * partitioning. The converse is not true, however: a partitioning which 
clusters nulls cannot
    +   * be guaranteed by one which does not cluster them. Thus, in general 
`guarantees` is not a
    +   * symmetric relation.
    +   *
    +   * Another way to think about `guarantees`: if `A.guarantees(B)`, then 
any partitioning of rows
    +   * produced by `A` could have also been produced by `B`.
    +   */
    +  def guarantees(other: Partitioning): Boolean = this == other
    +}
    +
    +object Partitioning {
    +  def allCompatible(partitionings: Seq[Partitioning]): Boolean = {
    +    // Note: this assumes transitivity
    +    partitionings.sliding(2).map {
    --- End diff --
    
    Nit: we can use `forall` instead of `map` here and remove the `forall` at 
the end.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9703] [SQL] Refactor EnsureRequirements...

Reply via email to