Re: [I] Potentially improve join performance by implementing a version of the take kernel that accepts an iterator of indices [datafusion]

via GitHub Tue, 03 Dec 2024 02:58:51 -0800


Dandandan commented on issue #13620:
URL: https://github.com/apache/datafusion/issues/13620#issuecomment-2514220770


   AFAIK the algorithm of NLJ is as follows:
   
   1. Generating indices in a certain pattern
   2. Taking left / right side based on pattern for columns in filter
   3. Creating filter based on expression 
   4. Filtering indices based on expression
   5. Taking values of output columns based on filtered indices
   
   Optionally some extra logic per join type is applied (updating visited rows).
   
   I think there is a couple of things that might be optimized in nested loop 
join:
   
   * taking values from left side for columns in the filter might be reused 
between iterations (as this might be the same every time)
   * ideally: we don't create the indices, take them, create a filter and apply 
the filters, but create the output indices or even the output directly. This 
would entail writing a specific implementation which, instead of calling the 
different kernels combines the individual operations just like @Rachelint did 
for hash aggregates. Possibly take with iterator could help here as it would 
avoid implementing this bit over and over for each "fused" implementation. I 
think we need to see an example of this first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Potentially improve join performance by implementing a version of the take kernel that accepts an iterator of indices [datafusion]

Reply via email to