[I] Make `GenericDistinctBuffer` generic over both `Hashable` and native types [datafusion]

via GitHub Thu, 13 Nov 2025 02:55:39 -0800


Jefffrey opened a new issue, #18670:
URL: https://github.com/apache/datafusion/issues/18670


   ### Is your feature request related to a problem or challenge?
   
   `GenericDistinctBuffer` was introduced by #18348 to make it easier to write 
distinct accumulators. Currently it uses `Hashable` to represent the items 
being stored:
   
   
https://github.com/apache/datafusion/blob/e42a0b626aa3b6b2e5e8b297c432e0c982706e8e/datafusion/functions-aggregate-common/src/utils.rs#L185-L188
   
   
https://github.com/apache/datafusion/blob/e42a0b626aa3b6b2e5e8b297c432e0c982706e8e/datafusion/functions-aggregate-common/src/utils.rs#L73-L81
   
   `Hashable` is mainly required because `f32`/`f64` aren't hashable natively, 
so this wrapper lets us hash them + all other native types (e.g. `i32`, `u64`). 
However I wonder if there is a way to use `Hashable` only for floats, whilst 
using the native `Hash` implementation for other native types (potential 
performance benefit?).
   
   ### Describe the solution you'd like
   
   Be able to use `GenericDistinctBuffer` generically over both `f32` and `i32` 
types without needing to force both through `Hashable` interface. This could be 
done by having duplicated versions, one that takes `Hashable` and one that 
takes natively hashable types, but ideally want a single solution.
   
   - If we had specialization this would be much easier 😅 
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   If we could get this to work, we could fold 
`PrimitiveDistinctCountAccumulator` together with 
`FloatDistinctCountAccumulator`:
   
   
https://github.com/apache/datafusion/blob/e42a0b626aa3b6b2e5e8b297c432e0c982706e8e/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs#L44-L51
   
   
https://github.com/apache/datafusion/blob/e42a0b626aa3b6b2e5e8b297c432e0c982706e8e/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs#L126-L129
   
   Also when making these changes we'd need to benchmark as likely there are 
performance implications; I initially tried forcing 
`PrimitiveDistinctCountAccumulator` to use `Hashable` but I kept hitting 
regressions on micro benchmark with `count(distinct)` queries, so need to keep 
that in mind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Make `GenericDistinctBuffer` generic over both `Hashable` and native types [datafusion]

Reply via email to