duongcongtoai opened a new issue, #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738

   ### Describe the bug
   
   The HashJoinExec decides output_partition based on this function: 
https://github.com/apache/arrow-datafusion/blob/b7a33317c2abf265f4ab6b3fe636f87c4d01334c/datafusion/core/src/physical_plan/joins/utils.rs#L90
   
   If PartitionMode is set to Partitioned, join_type is RIGHT, output_partition 
will depend on output_partition of the right child, this may cause missing 
execution on left child partitions, if left child has more partitions than 
right child partition: 
https://github.com/apache/arrow-datafusion/blob/e87754cfe3afa4c358a8ca9c21c3c4acd020dfe5/datafusion/core/src/physical_plan/joins/hash_join.rs#L413
   
   ### To Reproduce
   
   [Code in this 
gist](https://gist.github.com/duongcongtoai/4d82074e20c0dfeca8c324bba8ad0e66)
   
   Create 2 ExecutionPlan input from csv with only 1 field "id" and create a 
HashJoinExec from these inputs. Because during the execution, some parition 
from the left input is not executed on, they are never probed with associated 
rows in the right input, so result in a false join:
   ```
   +----+----+
   | id | id |
   +----+----+
   |    | 2  |
   |    | 3  |
   |    | 6  |
   |    | 7  |
   |    | 9  |
   |    | 1  |
   |    | 4  |
   |    | 5  |
   |    | 8  |
   +----+----+
   ```
   
   
   
   ### Expected behavior
   
   HashJoin executes correctly
   ```
   +----+----+
   | id | id |
   +----+----+
   | 1  | 1  |
   | 9  | 9  |
   | 5  | 5  |
   | 8  | 8  |
   | 6  | 6  |
   | 7  | 7  |
   | 4  | 4  |
   | 2  | 2  |
   | 3  | 3  |
   +----+----+
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to