[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #235: Failing tests in master: left_join_using and left_join

GitBox Sun, 02 May 2021 04:20:57 -0700


Dandandan edited a comment on issue #235:
URL: 
https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830792427



   Thanks @jorgecarleitao
   
   I added an implementation of left join where unmatched left rows are 
produced at the end of a stream.
   I'm not totally sure what you mean, I think we still have to keep track of 
rows that didn't match any row at the left side.
   For inner joins or on the the right/left part of a join for respectively 
left/right joins, we could indeed add a null filter on the columns, but this 
would be more of an optimization to push down null filters as far as possible. 
I think this is something Spark does too.
   
   I think there might be some possible improvements in the current 
implementation:
   
   * Use a bitmap structure instead of `Vec<bool>`. Efficiency-wise, the 
current PR should already be a large improvement though (don't have any 
benchmarks to prove it ATM, but a new hashset for each batch seems like it will 
be quite slow).
   * Generate the unmatched rows in batches with the configured batch size. 
Currently, it generates them in "one go".
   
   @andygrove this also seems to fix the tests in this issue, would be nice if 
you could confirm this on the
   
   More generally, maybe we should run the sql tests in some different settings 
(concurrency / optimizations, etc) to do some more exhaustive checking using 
all of the different configurations / environmental changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #235: Failing tests in master: left_join_using and left_join

Reply via email to