Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/8030#discussion_r36661594
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -312,7 +312,11 @@ private[sql] abstract class SparkStrategies extends
QueryPlanner[SparkPlan] {
throw new IllegalStateException(
"logical distinct operator should have been replaced by
aggregate in the optimizer")
case logical.Repartition(numPartitions, shuffle, child) =>
- execution.Repartition(numPartitions, shuffle, planLater(child)) ::
Nil
+ if (shuffle) {
+ execution.Exchange(HashPartitioning(child.output,
numPartitions), planLater(child)) :: Nil
--- End diff --
One slight efficiency concern: some Row implementations, such as
`UnsafeRow`, have specialized `Row.hashCode` implementations which can be much
faster than using a projection and hashing expressions to compute a hash code.
It would be nice to take advantage of this. I can think of two possible
approaches:
1. Declare a new `Partitioning` which is designed for this case, then add
support for this in `Exchange`. This is slightly complicated because we also
need to implement `compatibleWith` and `guarantees` correctly, but it's nice in
the sense that it's purely additive and doesn't risk introducing mistakes /
incompatibilities with the existing `HashPartitioning`. Given that we should
probably drop Repartition operators that are children of `Exchange` (as in
#7959), this should be fine.
2. Automatically perform this optimization whenever `HashPartitioning` is
called with `child.output`. This would avoid having to add a new
`HashPartitioner` that we have to explain / document, but carries its own
complexity: what happens if I have two `HashPartitionings`, one configured with
`child.output` and another configured with some permutation of `child.output`?
If one partitioning is able to take advantage of the optimization and another
isn't, then we risk computing different hashcodes using partitionings that
should be logically equivalent.
/cc @yhuai, any thoughts on this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]