alamb commented on issue #6996: URL: https://github.com/apache/arrow-rs/issues/6996#issuecomment-2607111776
I think it is a very good observation that there are several structures in DataFusion that basically look like hash sets / hash maps on arrays. ```rust let hash_table = HashMap::new(key_array.data_type(), value_array.data_type()); // returns a BooleanArray that let inserted = hash_table.insert(key_array, value ``` And there are similar things for Set. I am thinking specifically about code like https://github.com/apache/datafusion/blob/274e5356ceb4c559ab4105478e75817a302d2f13/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs#L44-L51 and https://github.com/apache/datafusion/blob/274e5356ceb4c559ab4105478e75817a302d2f13/datafusion/functions-aggregate-common/src/aggregate/count_distinct/bytes.rs#L30-L46 In my opinon such a primitive makes a lot of sense to consider moving upstream in arrow. The open question in my mind is exactly what the API would look like Perhaps to begin with, we could begin in DataFusion refactoring the various maps / sets into an `ArrowSet` / `ArrowMap`. Once we have figured out what the API is then would be an excellent time to consider proposing upstreaming it into arrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org