kumarUjjawal commented on code in PR #21575:
URL: https://github.com/apache/datafusion/pull/21575#discussion_r3071868071


##########
datafusion/functions-aggregate/benches/count_distinct.rs:
##########
@@ -87,6 +87,30 @@ fn create_i16_array(n_distinct: usize) -> Int16Array {
         .collect()
 }
 
+fn prepare_args(data_type: DataType) -> (Arc<Schema>, 
AccumulatorArgs<'static>) {

Review Comment:
   This helper uses Box::leak, and in the grouped path it is called inside 
b.iter. When COUNT(DISTINCT) gets real group accumulator support, every 
benchmark iteration will leak  memory. That will affect the benchmark result 
and make it less reliable. Please avoid leaking here, or build the args once 
outside the timed loop.



##########
datafusion/functions-aggregate/benches/count_distinct.rs:
##########
@@ -150,5 +174,101 @@ fn count_distinct_benchmark(c: &mut Criterion) {
     });
 }
 
-criterion_group!(benches, count_distinct_benchmark);
+/// Create group indices with uniform distribution
+fn create_uniform_groups(num_groups: usize) -> Vec<usize> {
+    let mut rng = StdRng::seed_from_u64(42);
+    (0..BATCH_SIZE)
+        .map(|_| rng.random_range(0..num_groups))
+        .collect()
+}
+
+/// Create group indices with skewed distribution (80% in 20% of groups)
+fn create_skewed_groups(num_groups: usize) -> Vec<usize> {
+    let mut rng = StdRng::seed_from_u64(42);
+    let hot_groups = (num_groups / 5).max(1);
+    (0..BATCH_SIZE)
+        .map(|_| {
+            if rng.random_range(0..100) < 80 {
+                rng.random_range(0..hot_groups)
+            } else {
+                rng.random_range(0..num_groups)
+            }
+        })
+        .collect()
+}
+
+fn count_distinct_groups_benchmark(c: &mut Criterion) {
+    let count_fn = Count::new();
+
+    // bench different scenarios
+    let scenarios = [
+        // (name, num_groups, distinct_pct, group_fn)
+        ("sparse_uniform", 10, 80, "uniform"),
+        ("moderate_uniform", 100, 80, "uniform"),
+        ("dense_uniform", 1000, 80, "uniform"),
+        ("sparse_skewed", 10, 80, "skewed"),
+        ("dense_skewed", 1000, 80, "skewed"),
+        ("sparse_high_cardinality", 10, 99, "uniform"),
+        ("dense_low_cardinality", 1000, 20, "uniform"),
+    ];
+
+    for (name, num_groups, distinct_pct, group_type) in scenarios {
+        let n_distinct = BATCH_SIZE * distinct_pct / 100;
+        let values = Arc::new(create_i64_array(n_distinct)) as ArrayRef;
+        let group_indices = if group_type == "uniform" {
+            create_uniform_groups(num_groups)
+        } else {
+            create_skewed_groups(num_groups)
+        };
+
+        let (_schema, args) = prepare_args(DataType::Int64);
+
+        if count_fn.groups_accumulator_supported(args.clone()) {

Review Comment:
   This benchmark name is a bit misleading right now. COUNT(DISTINCT) still 
does not support groups_accumulator_supported, so this code always runs the 
fallback path. The fallback creates a new one-row array for every input row, so 
the benchmark mostly measures tiny allocations and per-row updates, not real 
grouped distinct aggregation. Please either rename the benchmark to make that 
clear, or change the fallback so it measures something closer to the real 
grouped case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to