[GitHub] spark pull request: [SPARK-9702][SQL] Use Exchange to perform shuf...

JoshRosen Mon, 10 Aug 2015 10:51:22 -0700

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8030#discussion_r36661594
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
    @@ -312,7 +312,11 @@ private[sql] abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
             throw new IllegalStateException(
               "logical distinct operator should have been replaced by 
aggregate in the optimizer")
           case logical.Repartition(numPartitions, shuffle, child) =>
    -        execution.Repartition(numPartitions, shuffle, planLater(child)) :: 
Nil
    +        if (shuffle) {
    +          execution.Exchange(HashPartitioning(child.output, 
numPartitions), planLater(child)) :: Nil
    --- End diff --
    
    One slight efficiency concern: some Row implementations, such as 
`UnsafeRow`, have specialized `Row.hashCode` implementations which can be much 
faster than using a projection and hashing expressions to compute a hash code.
    
    It would be nice to take advantage of this. I can think of two possible 
approaches:
    
    1. Declare a new `Partitioning` which is designed for this case, then add 
support for this in `Exchange`. This is slightly complicated because we also 
need to implement `compatibleWith` and `guarantees` correctly, but it's nice in 
the sense that it's purely additive and doesn't risk introducing mistakes / 
incompatibilities with the existing `HashPartitioning`. Given that we should 
probably drop Repartition operators that are children of `Exchange` (as in 
#7959), this should be fine.
    2. Automatically perform this optimization whenever `HashPartitioning` is 
called with `child.output`. This would avoid having to add a new 
`HashPartitioner` that we have to explain / document, but carries its own 
complexity: what happens if I have two `HashPartitionings`, one configured with 
`child.output` and another configured with some permutation of `child.output`? 
If one partitioning is able to take advantage of the optimization and another 
isn't, then we risk computing different hashcodes using partitionings that 
should be logically equivalent.
    
    /cc @yhuai, any thoughts on this?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9702][SQL] Use Exchange to perform shuf...

Reply via email to