[GitHub] [spark] c21 commented on a change in pull request #29130: [SPARK-32330][SQL] Preserve shuffled hash join build side partitioning

GitBox Fri, 17 Jul 2020 08:08:26 -0700


c21 commented on a change in pull request #29130:
URL: https://github.com/apache/spark/pull/29130#discussion_r456502678




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##########
@@ -47,6 +47,18 @@ case class ShuffledHashJoinExec(
     "buildDataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size of 
build side"),
     "buildTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to build 
hash map"))
 
+  override def outputPartitioning: Partitioning = joinType match {

Review comment:
       > Why does shuffle hash join not support FullOuter?
   
   @cloud-fan sorry if I miss anything, but [isn't this true 
now](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L196-L210)?
 Given current spark implementation for hash join, stream side looks up in 
build side hash map, it can handle non-matching keys from stream side if 
there's no match in build side hash map. But it cannot handle non-matching keys 
from build side, as there's no info persisted from stream side.
   
   I feel an interesting followup could be to handle full outer join in 
shuffled hash join, where when looking up stream side keys from build side 
`HashedRelation`. Mark this info inside build side `HashedRelation`, and after 
reading all rows from stream side, output all non-matching rows from build side 
based on modified `HashedRelation`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #29130: [SPARK-32330][SQL] Preserve shuffled hash join build side partitioning

Reply via email to