wecharyu commented on PR #12264:
URL: https://github.com/apache/gluten/pull/12264#issuecomment-4668118238

   @JkSelf Thanks for the prompt response. `HashedRelationBroadcastMode` does 
not contains `joinType`, which could also change the build hash table data. For 
example in this PR's test the `LEFT SEMI JOIN` and `INNER JOIN` would use the 
same hash table, which cause the test failed before this PR. 
https://github.com/apache/spark/blob/62ae4db28f3be8a0ca2c3016d27ca5a62f02915d/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L1152-L1153
   
   And since current Gluten is still broadcast the raw build plan data bytes 
instead of hash table,I agree with @liujiayi771 that ·BroadcastExchangeExec· 
can be reused for the same build plan even when the build keys differ. This 
reuse avoids broadcasting the same data multiple times.
   
   I think now we can make following improvements:
   1. Move `doCanonicalizeForBroadcastMode()` back to reuse the broadcast 
exchange as much as possible.
   2. Associate the constructed hash table with the `plan_id`, `build_keys`, 
`drop_duplicates`, and `null_aware` flag, as these attributes uniquely identify 
the hash table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to