c21 commented on a change in pull request #29130:
URL: https://github.com/apache/spark/pull/29130#discussion_r456502678
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##########
@@ -47,6 +47,18 @@ case class ShuffledHashJoinExec(
"buildDataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size of
build side"),
"buildTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to build
hash map"))
+ override def outputPartitioning: Partitioning = joinType match {
Review comment:
> Why does shuffle hash join not support FullOuter?
@cloud-fan sorry if I miss anything, but [isn't this true
now](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L196-L210)?
Given current spark implementation for hash join, stream side looks up in
build side hash map, it can handle non-matching keys from stream side if
there's no match in build side hash map. But it cannot handle non-matching keys
from build side, as there's no info persisted from stream side.
I feel an interesting followup could be to handle full outer join in
shuffled hash join, where when looking up stream side keys from build side
`HashedRelation`. Mark this info inside build side `HashedRelation`, and after
reading all rows from stream side, output all non-matching rows from build side
based on modified `HashedRelation`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]