tustvold commented on code in PR #7358:
URL: https://github.com/apache/arrow-datafusion/pull/7358#discussion_r1301853347
##########
datafusion/physical-expr/src/aggregate/sum_distinct.rs:
##########
@@ -106,29 +116,56 @@ impl PartialEq<dyn Any> for DistinctSum {
}
}
-#[derive(Debug)]
-struct DistinctSumAccumulator {
- hash_values: HashSet<ScalarValue, RandomState>,
+/// A wrapper around a type to provide hash for floats
+#[derive(Copy, Clone)]
+struct Hashable<T>(T);
+
+impl<T: ToByteSlice> std::hash::Hash for Hashable<T> {
+ fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+ self.0.to_byte_slice().hash(state)
Review Comment:
This does use ahash, it overrides the BuildHasher used by the HashSet. I
think we could definitely do something better here, but in the absence of
benchmarks I'm keen to go with simple and we can always optimise it later down
the line. Regardless this should be significantly faster than the prior approach
Edit: If someone really care about the performance of DistinctSum,
implementing a GroupsAccumulator will likely yield far greater performance than
any incremental tweaking of this Accumulator-based version
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]