Re: [I] Left join doesn't preserve order of left for right missing values [arrow]

via GitHub Thu, 28 Dec 2023 18:59:38 -0800


zanmato1984 commented on issue #37542:
URL: https://github.com/apache/arrow/issues/37542#issuecomment-1871692692

This is an implementation specific behavior, i.e., depending on the
underlying hash join algorithm. Currently there are two types of hash join
implementation in acero, both having the similar logic. When doing left outer
or full outer joins, the matched rows in the left side are first collected, by
probing the hash table built from the right side. Then the unmatched rows in
the left side are collected. This results in the order you observed. You can
check the code below:

For hash join (the normal algorithm), collecting the unmatched rows:

https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/hash_join.cc#L440-L445

For swiss join (a variation of hash join), first collecting the matched rows:

https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/swiss_join.cc#L1964-L1998
then collecting the unmatched rows:

https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/swiss_join.cc#L2005-L2025

> I know there's no explicit guaranty on the order of the output data on the
join, but it's weird that it preserve order when values are present, and
changes it when they are missing.

Yes, you shouldn't make any assumption of ordering in SQL except explicitly
specified by `order by` clause. Hope the above information explains your
concern.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Left join doesn't preserve order of left for right missing values [arrow]

Reply via email to