Cheng Su created SPARK-32649:
--------------------------------
Summary: Optimize BHJ/SHJ inner and semi join with empty hashed
relation
Key: SPARK-32649
URL: https://issues.apache.org/jira/browse/SPARK-32649
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su
With `EmptyHashedRelation` introduced in
[https://github.com/apache/spark/pull/29389], it inspired me that there's a
minor optimization we can apply to broadcast hash join and shuffled hash join
if build side hashed relation is empty.
If build side hashed relation is empty (i.e. build side is empty)
1.inner join: we don't need to execute stream side at all, just return an empty
iterator -
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L152]
2.semi join: we don't need to execute stream side at all, just return an empty
iterator -
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L227]
.
This is not common that build side is empty, but in case it is, we can leverage
it to not execute stream side at all for better query CPU/IO performance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]