Re: [PR] Refactor distinct aggregate implementations to use common buffer [datafusion]

via GitHub Wed, 29 Oct 2025 01:56:47 -0700


Jefffrey commented on code in PR #18348:
URL: https://github.com/apache/datafusion/pull/18348#discussion_r2472190872



##########
datafusion/functions-aggregate-common/src/utils.rs:
##########
@@ -167,3 +175,90 @@ impl<T: DecimalType> DecimalAverager<T> {
         }
     }
 }
+
+/// Generic way to collect distinct values for accumulators.
+///
+/// The intermediate state is represented as a List of scalar values updated by
+/// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values
+/// in the final evaluation step so that we avoid expensive conversions and
+/// allocations during `update_batch`.
+pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> {

Review Comment:
   Main implementation here; I toyed with the idea of making this implement 
`Accumulator` and have the different functions (like `median` and 
`percentile_cont`) provide their evaluate logic as a closure but it got a bit 
messy; so for now they delegate their `state`/`update_batch`/`merge_batch` to 
this inner struct, which allows them to grab the final set of distinct `values` 
for them to do their own `evaluate`



##########
datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs:
##########


Review Comment:
   It would be nice if I can pull in `PrimitiveDistinctCountAccumulator` to the 
deduplication as well, however it is specialized for types which don't need to 
hash through `Hashable` (aka non-float types) and I think there might be a 
performance hit if I try force them to use `Hashable` 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Refactor distinct aggregate implementations to use common buffer [datafusion]

Reply via email to