Re: [PR] Improve speed of `median` by implementing special `GroupsAccumulator` [datafusion]

via GitHub Tue, 28 Jan 2025 21:16:07 -0800


korowa commented on code in PR #13681:
URL: https://github.com/apache/datafusion/pull/13681#discussion_r1933264875



##########
datafusion/functions-aggregate/src/median.rs:
##########
@@ -230,6 +276,212 @@ impl<T: ArrowNumericType> Accumulator for 
MedianAccumulator<T> {
     }
 }
 
+/// The median groups accumulator accumulates the raw input values
+///
+/// For calculating the accurate medians of groups, we need to store all values
+/// of groups before final evaluation.
+/// So values in each group will be stored in a `Vec<T>`, and the total group 
values
+/// will be actually organized as a `Vec<Vec<T>>`.
+///
+#[derive(Debug)]
+struct MedianGroupsAccumulator<T: ArrowNumericType + Send> {
+    data_type: DataType,
+    group_values: Vec<Vec<T::Native>>,

Review Comment:
   There was some improvements, but overall results for clickbench q9 (I was 
mostly looking at this query) were like x2.63 for GroupsAccumulator, and x2.30 
for the regular Accumulator -- so it would be like 13-15% overall difference, 
which is not as massive as this PR results.
   
   However, maybe things has changed in GroupsAccumulator implementation, and 
now even plain `Vec<HashSet<>>` will be way faster.
   
   UPD: and, yes, maybe producing state, as pointed out by @alamb above, was 
(at least partially) the cause of non-significant improvement -- in count 
distinct it was implemented via `ListArray::from_iter_primitive` 
([commit](https://github.com/apache/datafusion/pull/8721/commits/aa7199e1aab401c816e8089d4a4dab79e6e04855)),
 instead of building it from single flattened array and its offsets.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve speed of `median` by implementing special `GroupsAccumulator` [datafusion]

Reply via email to