bart-samwel commented on pull request #29181: URL: https://github.com/apache/spark/pull/29181#issuecomment-663059685
> @bart-samwel - just to bring us in the same page. > > Current spark scala/java implementation for hash join (broadcast hash join and shuffled hash join) has following restriction: > > * [For left outer join, stream side can only be left side](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L288). > * [Similarly, for right outer join, stream side can only be right side](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L281). > * Full outer join is not supported in broadcast hash join and shuffled hash join (have to do a sort merge join, code reference same as above). > > Both of cases you mentioned are to do right outer join, with left stream side. This will not happen. > > A separate topic: I think it would be interesting to explore support full outer join in shuffled hash join and broadcast hash join where I discussed with @cloud-fan in [another PR](https://github.com/apache/spark/pull/29130#discussion_r456502678). I created a JIRA for this now - https://issues.apache.org/jira/browse/SPARK-32399. This should help save shuffle and sort as currently for full outer join, we always do a sort merge join no matter of table size. BTW, does delta engine support full outer join in hash join? Would like to understand more here. Thanks. Doing full outer for SHJ is not that hard so we should have that. BHJ is harder because you have to merge the "probedness" of all tasks before figuring out which rows you need to emit. (Delta engine will indeed support full outer join in SHJ.) Let's be future proof for these cases! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
