Re: [PR] POC: Vectorized hashtable for aggregation [datafusion]

via GitHub Mon, 28 Oct 2024 01:58:29 -0700


Rachelint commented on code in PR #12996:
URL: https://github.com/apache/datafusion/pull/12996#discussion_r1818601807



##########
datafusion/physical-plan/src/aggregates/group_values/group_column.rs:
##########
@@ -287,6 +469,63 @@ where
         };
     }
 
+    fn vectorized_equal_to(

Review Comment:
   > At least `take` for `StringArray` is expensive as it needs to copy the 
string data, recheck the offsets, etc. For string / byte views this will 
probably be much better. I think some approach for using `take` + `eq` for 
primitives while using a custom one for string arrays will likely be best 
overal.
   > 
   > Even better would be to write a optimized implementation (an arrow kernel 
that performs equality of two arrays by indices efficiently while applying the 
same tricks done by take / eq kernels). The join implementation can benefit 
from this as well.
   
   Make sense for me, and support it in `arrow` seems really good.
   
   I plan to sort out codes in this pr firstly, and then try to improve the 
primitive case using `take` + `eq`.
   I guess some case like `q32` in clickbench is possible to get benefit from 
it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] POC: Vectorized hashtable for aggregation [datafusion]

Reply via email to