ShashidharM0118 opened a new pull request, #18763:
URL: https://github.com/apache/datafusion/pull/18763

   ## Which issue does this PR close?
   
   - Closes #18670
   
   ## Rationale for this change
   
   As described in #18670, `GenericDistinctBuffer` previously forced all types 
through the `Hashable<T>` wrapper, even for natively hashable types like 
integers. This was necessary to handle floats (which don't implement `Hash` 
natively), but it meant that types with efficient native `Hash` implementations 
(like `i32`, `u64`, etc.) were unnecessarily wrapped, potentially impacting 
performance.
   
   This PR introduces a trait-based solution that allows 
`GenericDistinctBuffer` to work generically with both native hashable types and 
types requiring the `Hashable` wrapper, without code duplication.
   
   ## What changes are included in this PR?
   
   1. **Introduced `DistinctStorage` trait**: A new trait that abstracts over 
different storage strategies for distinct value tracking. It provides two 
implementations:
      - `HashSet<T, RandomState>` for natively hashable types (integers, 
decimals, dates, timestamps)
      - `HashSet<Hashable<T>, RandomState>` for types requiring special hashing 
(floats)
   
   2. **Made `GenericDistinctBuffer` generic over storage**: Changed from 
`GenericDistinctBuffer<T>` to `GenericDistinctBuffer<T, S>` where `S: 
DistinctStorage<Native = T::Native>`. This allows the buffer to use the 
appropriate storage strategy based on the type.
   
   3. **Unified distinct count accumulators**: Merged 
`PrimitiveDistinctCountAccumulator` and `FloatDistinctCountAccumulator` into a 
single generic implementation `PrimitiveDistinctCountAccumulator<T, S>`. The 
former `FloatDistinctCountAccumulator` is now a type alias using `Hashable` 
storage.
   
   4. **Updated all usages**: Modified all call sites in `count.rs`, 
`sum_distinct`, `median`, and `percentile_cont` to specify the appropriate 
storage type (native `HashSet<T>` for integers/dates/timestamps, 
`HashSet<Hashable<T>>` for floats).
   
   5. **Added `Send + Sync` bounds**: Added necessary trait bounds for proper 
thread-safe trait implementations in DataFusion's parallel execution 
environment.
   
   ## Are these changes tested?
   
   Yes
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to