Re: [PR] WIP: feat: emitting partial join results in `HashJoinStream` [arrow-datafusion]

via GitHub Fri, 03 Nov 2023 12:24:39 -0700


korowa commented on code in PR #8020:
URL: https://github.com/apache/arrow-datafusion/pull/8020#discussion_r1382087972



##########
datafusion/sqllogictest/test_files/join_disable_repartition_joins.slt:
##########
@@ -72,11 +72,11 @@ SELECT t1.a, t1.b, t1.c, t2.a as a2
  ON t1.d = t2.d ORDER BY a2, t2.b
  LIMIT 5
 ----
-0 0 0 0
-0 0 2 0
-0 0 3 0
-0 0 6 0
-0 0 20 0
+1 3 95 0

Review Comment:
   Well, yes -- this query result ordered by only t2 with random order on t1, 
and current behaviour for indices-mathching function is to 
[iterate](https://github.com/apache/arrow-datafusion/blob/c2e768052c43e4bab6705ee76befc19de383c2cb/datafusion/physical-plan/src/joins/hash_join.rs#L881)
 over inverted probe-side indices and attach build-side indices to them 
(HashMap + Vector data structure for also emits build-side indices in reverse 
order), and after matching whole probe-side side, resulting arrays 
[inverted](https://github.com/apache/arrow-datafusion/blob/c2e768052c43e4bab6705ee76befc19de383c2cb/datafusion/physical-plan/src/joins/hash_join.rs#L915-L916)
 again -- it allows to return right-left side indices in their natural order.
   
   In case of partial output -- I can't see any other option besides iterating 
probe-side naturally (otherwise the order of record would be broken as there is 
no "full batch" anymore to re-sort it), but in the same time build-side is 
stored in same data structure with reverse order.
   
   So, it's a side effect -- hash join still maintains probe-side input order, 
but not for build-side anymore (guess it can potentially be achieved by 
tweaking `collect_build_side` function) -- that's why t1 order in this query 
result is inverted now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] WIP: feat: emitting partial join results in `HashJoinStream` [arrow-datafusion]

Reply via email to