duongcongtoai opened a new issue, #5738: URL: https://github.com/apache/arrow-datafusion/issues/5738
### Describe the bug The HashJoinExec decides output_partition based on this function: https://github.com/apache/arrow-datafusion/blob/b7a33317c2abf265f4ab6b3fe636f87c4d01334c/datafusion/core/src/physical_plan/joins/utils.rs#L90 If PartitionMode is set to Partitioned, join_type is RIGHT, output_partition will depend on output_partition of the right child, this may cause missing execution on left child partitions, if left child has more partitions than right child partition: https://github.com/apache/arrow-datafusion/blob/e87754cfe3afa4c358a8ca9c21c3c4acd020dfe5/datafusion/core/src/physical_plan/joins/hash_join.rs#L413 ### To Reproduce [Code in this gist](https://gist.github.com/duongcongtoai/4d82074e20c0dfeca8c324bba8ad0e66) Create 2 ExecutionPlan input from csv with only 1 field "id" and create a HashJoinExec from these inputs. Because during the execution, some parition from the left input is not executed on, they are never probed with associated rows in the right input, so result in a false join: ``` +----+----+ | id | id | +----+----+ | | 2 | | | 3 | | | 6 | | | 7 | | | 9 | | | 1 | | | 4 | | | 5 | | | 8 | +----+----+ ``` ### Expected behavior HashJoin executes correctly ``` +----+----+ | id | id | +----+----+ | 1 | 1 | | 9 | 9 | | 5 | 5 | | 8 | 8 | | 6 | 6 | | 7 | 7 | | 4 | 4 | | 2 | 2 | | 3 | 3 | +----+----+ ``` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
