wecharyu commented on PR #12264: URL: https://github.com/apache/gluten/pull/12264#issuecomment-4668118238
@JkSelf Thanks for the prompt response. `HashedRelationBroadcastMode` does not contains `joinType`, which could also change the build hash table data. For example in this PR's test the `LEFT SEMI JOIN` and `INNER JOIN` would use the same hash table, which cause the test failed before this PR. https://github.com/apache/spark/blob/62ae4db28f3be8a0ca2c3016d27ca5a62f02915d/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L1152-L1153 And since current Gluten is still broadcast the raw build plan data bytes instead of hash table,I agree with @liujiayi771 that ·BroadcastExchangeExec· can be reused for the same build plan even when the build keys differ. This reuse avoids broadcasting the same data multiple times. I think now we can make following improvements: 1. Move `doCanonicalizeForBroadcastMode()` back to reuse the broadcast exchange as much as possible. 2. Associate the constructed hash table with the `plan_id`, `build_keys`, `drop_duplicates`, and `null_aware` flag, as these attributes uniquely identify the hash table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
