comphead commented on issue #12454:
URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2356313346

   @thinkharderdev thats totally true, outer joins, and especially filtered 
outer joins require to track filtered/matched indexes before emitting the final 
result. The similar problem we encountered in SortMergeJoin 
https://github.com/apache/datafusion/issues/12359 even in a single node 
environment but the idea is the same. The processed partition has no idea about 
other partitions and the row can find a match in partition0, but no match in 
partition1. In this case the join result emitted not correctly. 
   
   There are probably 2 options to handle it:
   - Final join stage - which require sending data to some final join stage. to 
avoid data to be sent to a single node you have to partition that way so the 
same key has to be on same partition. This is skew prone of course.
   - Copy small table to every node, but again it will require proper 
partitioning to preserve the same key within the same partition so you can 
track matches correctly.
   
   Both of approaches require to partition the keys appropriately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to