rluvaton opened a new issue, #6996:
URL: https://github.com/apache/arrow-rs/issues/6996

   I just read the 
[Photon](https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf) paper 
from 2022 and saw their vectorized implementation for hash table, I also 
noticed that someone opened an issue in DataFusion 
https://github.com/apache/datafusion/issues/7095 for implementing it for group 
aggregate
   
   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I would like to have `HashSet`/`HashMap` that would support all Hash table 
functionality but with Arrays as input.
   
   Problem:
   DataFusion has `array_agg` with distinct support, if you look at the 
implementation it just keep adding to `HashSet`
   
https://github.com/apache/datafusion/blob/6c9355d5be8b6045865fed67cb6d028b2dfc2e06/datafusion/functions-aggregate/src/array_agg.rs#L268-L281
   
   this works but can be improved with computing all the hashes first, and then 
do probing in a tight loop
   
   **Describe the solution you'd like**
   It would be helpful to use there and in other places a HashSet/HashMap that 
can
   1. Insert all values from an array
   2. Check all values in array exists and return a BooleanArray for the result
   3. Get all the values that match each key in the array
   
   **Describe alternatives you've considered**
   Implement it everywhere that need HashMap/HashSet or create external crate
   
   **Additional context**
   The way I see it there will be couple of implementation
   1. Primitive/boolean
   2. Bytes
   3. Generic that will use `arrow-row`
   
   
   I'm willing to create a PR for that. I see it as using internally the 
hashbrown raw API to implement that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to