zanmato1984 commented on issue #37542:
URL: https://github.com/apache/arrow/issues/37542#issuecomment-1871692692

   This is an implementation specific behavior, i.e., depending on the 
underlying hash join algorithm. Currently there are two types of hash join 
implementation in acero, both having the similar logic. When doing left outer 
or full outer joins, the matched rows in the left side are first collected, by 
probing the hash table built from the right side. Then the unmatched rows in 
the left side are collected. This results in the order you observed. You can 
check the code below:
   
   For hash join (the normal algorithm), collecting the unmatched rows:
   
https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/hash_join.cc#L440-L445
   
   For swiss join (a variation of hash join), first collecting the matched rows:
   
https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/swiss_join.cc#L1964-L1998
   then collecting the unmatched rows:
   
https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/swiss_join.cc#L2005-L2025
   
   > I know there's no explicit guaranty on the order of the output data on the 
join, but it's weird that it preserve order when values are present, and 
changes it when they are missing.
   
   Yes, you shouldn't make any assumption of ordering in SQL except explicitly 
specified by `order by` clause. Hope the above information explains your 
concern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to