korowa commented on code in PR #8020: URL: https://github.com/apache/arrow-datafusion/pull/8020#discussion_r1382087972
########## datafusion/sqllogictest/test_files/join_disable_repartition_joins.slt: ########## @@ -72,11 +72,11 @@ SELECT t1.a, t1.b, t1.c, t2.a as a2 ON t1.d = t2.d ORDER BY a2, t2.b LIMIT 5 ---- -0 0 0 0 -0 0 2 0 -0 0 3 0 -0 0 6 0 -0 0 20 0 +1 3 95 0 Review Comment: Well, yes -- this query result ordered by only t2 with random order on t1, and current behaviour for indices-mathching function is to [iterate](https://github.com/apache/arrow-datafusion/blob/c2e768052c43e4bab6705ee76befc19de383c2cb/datafusion/physical-plan/src/joins/hash_join.rs#L881) over inverted probe-side indices and attach build-side indices to them (HashMap + Vector data structure for also emits build-side indices in reverse order), and after matching whole probe-side side, resulting arrays [inverted](https://github.com/apache/arrow-datafusion/blob/c2e768052c43e4bab6705ee76befc19de383c2cb/datafusion/physical-plan/src/joins/hash_join.rs#L915-L916) again -- it allows to return right-left side indices in their natural order. In case of partial output -- I can't see any other option besides iterating probe-side naturally (otherwise the order of record would be broken as there is no "full batch" anymore to re-sort it), but in the same time build-side is stored in same data structure with reverse order. So, it's a side effect -- hash join still maintains probe-side input order, but not for build-side anymore (guess it can potentially be achieved by tweaking `collect_build_side` function) -- that's why t1 order in this query result is inverted now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
