alihan-synnada opened a new issue, #13620:
URL: https://github.com/apache/datafusion/issues/13620

   ### Is your feature request related to a problem or challenge?
   
   Collecting the filtered indices to a PrimitiveArray takes a lot of memory 
and time. Using an Iterator based design instead would save a lot of 
intermediate memory and potentially speed up the join operation due to fewer 
cache misses and less copying.
   
   ### Describe the solution you'd like
   
   Create a version of arrow's compute::take kernel that accepts an iterator of 
indices. Benchmark and figure out where it's worth using Iterators over 
PrimitiveArrays.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   I have tried to create a POC but I seem to get different results whenever I 
benchmark it and I couldn't figure out what's wrong. I also had lots of trouble 
with taking the ownership of the values array of PrimitiveArrays which are 
guaranteed to not have nulls (namely the indices cache and the mask produced by 
the filter). 
   
   Furthermore, it's possible to use the mask's `.values().set_indices()` 
iterator to generate indices to be used in the join, because the left-right 
index pairs are a function of the index of the mask in the form of `(index % 
left_batch.num_rows(), index / left_batch.num_rows())` and it vectorizes nicely.
   
   I plan to share the entirety of my POC ([here's a potential implementation 
of 
take_with_iter](https://gist.github.com/alihan-synnada/4c327b873ea4152cfde0e2816a59bc94))
 and benchmarks once I can tame them and generate the same results 
deterministically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to