kumarUjjawal commented on code in PR #21575:
URL: https://github.com/apache/datafusion/pull/21575#discussion_r3071868071
##########
datafusion/functions-aggregate/benches/count_distinct.rs:
##########
@@ -87,6 +87,30 @@ fn create_i16_array(n_distinct: usize) -> Int16Array {
.collect()
}
+fn prepare_args(data_type: DataType) -> (Arc<Schema>,
AccumulatorArgs<'static>) {
Review Comment:
This helper uses Box::leak, and in the grouped path it is called inside
b.iter. When COUNT(DISTINCT) gets real group accumulator support, every
benchmark iteration will leak memory. That will affect the benchmark result
and make it less reliable. Please avoid leaking here, or build the args once
outside the timed loop.
##########
datafusion/functions-aggregate/benches/count_distinct.rs:
##########
@@ -150,5 +174,101 @@ fn count_distinct_benchmark(c: &mut Criterion) {
});
}
-criterion_group!(benches, count_distinct_benchmark);
+/// Create group indices with uniform distribution
+fn create_uniform_groups(num_groups: usize) -> Vec<usize> {
+ let mut rng = StdRng::seed_from_u64(42);
+ (0..BATCH_SIZE)
+ .map(|_| rng.random_range(0..num_groups))
+ .collect()
+}
+
+/// Create group indices with skewed distribution (80% in 20% of groups)
+fn create_skewed_groups(num_groups: usize) -> Vec<usize> {
+ let mut rng = StdRng::seed_from_u64(42);
+ let hot_groups = (num_groups / 5).max(1);
+ (0..BATCH_SIZE)
+ .map(|_| {
+ if rng.random_range(0..100) < 80 {
+ rng.random_range(0..hot_groups)
+ } else {
+ rng.random_range(0..num_groups)
+ }
+ })
+ .collect()
+}
+
+fn count_distinct_groups_benchmark(c: &mut Criterion) {
+ let count_fn = Count::new();
+
+ // bench different scenarios
+ let scenarios = [
+ // (name, num_groups, distinct_pct, group_fn)
+ ("sparse_uniform", 10, 80, "uniform"),
+ ("moderate_uniform", 100, 80, "uniform"),
+ ("dense_uniform", 1000, 80, "uniform"),
+ ("sparse_skewed", 10, 80, "skewed"),
+ ("dense_skewed", 1000, 80, "skewed"),
+ ("sparse_high_cardinality", 10, 99, "uniform"),
+ ("dense_low_cardinality", 1000, 20, "uniform"),
+ ];
+
+ for (name, num_groups, distinct_pct, group_type) in scenarios {
+ let n_distinct = BATCH_SIZE * distinct_pct / 100;
+ let values = Arc::new(create_i64_array(n_distinct)) as ArrayRef;
+ let group_indices = if group_type == "uniform" {
+ create_uniform_groups(num_groups)
+ } else {
+ create_skewed_groups(num_groups)
+ };
+
+ let (_schema, args) = prepare_args(DataType::Int64);
+
+ if count_fn.groups_accumulator_supported(args.clone()) {
Review Comment:
This benchmark name is a bit misleading right now. COUNT(DISTINCT) still
does not support groups_accumulator_supported, so this code always runs the
fallback path. The fallback creates a new one-row array for every input row, so
the benchmark mostly measures tiny allocations and per-row updates, not real
grouped distinct aggregation. Please either rename the benchmark to make that
clear, or change the fallback so it measures something closer to the real
grouped case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]