Jefffrey commented on code in PR #18348:
URL: https://github.com/apache/datafusion/pull/18348#discussion_r2472190872
##########
datafusion/functions-aggregate-common/src/utils.rs:
##########
@@ -167,3 +175,90 @@ impl<T: DecimalType> DecimalAverager<T> {
}
}
}
+
+/// Generic way to collect distinct values for accumulators.
+///
+/// The intermediate state is represented as a List of scalar values updated by
+/// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values
+/// in the final evaluation step so that we avoid expensive conversions and
+/// allocations during `update_batch`.
+pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> {
Review Comment:
Main implementation here; I toyed with the idea of making this implement
`Accumulator` and have the different functions (like `median` and
`percentile_cont`) provide their evaluate logic as a closure but it got a bit
messy; so for now they delegate their `state`/`update_batch`/`merge_batch` to
this inner struct, which allows them to grab the final set of distinct `values`
for them to do their own `evaluate`
##########
datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs:
##########
Review Comment:
It would be nice if I can pull in `PrimitiveDistinctCountAccumulator` to the
deduplication as well, however it is specialized for types which don't need to
hash through `Hashable` (aka non-float types) and I think there might be a
performance hit if I try force them to use `Hashable` 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]