zanmato1984 commented on code in PR #44053:
URL: https://github.com/apache/arrow/pull/44053#discussion_r1763342723
##########
cpp/src/arrow/acero/aggregate_benchmark.cc:
##########
@@ -866,5 +887,61 @@
BENCHMARK(TDigestKernelDoubleMedian)->Apply(QuantileKernelArgs);
BENCHMARK(TDigestKernelDoubleDeciles)->Apply(QuantileKernelArgs);
BENCHMARK(TDigestKernelDoubleCentiles)->Apply(QuantileKernelArgs);
+//
+// Segmented Aggregate
+//
+
+static void BenchmarkSegmentedAggregate(
+ benchmark::State& state, int64_t num_rows, std::vector<Aggregate>
aggregates,
+ const std::vector<std::shared_ptr<Array>>& arguments,
+ const std::vector<std::shared_ptr<Array>>& keys, int64_t num_segment_keys,
+ int64_t num_segments) {
+ ASSERT_GT(num_segments, 0);
+
+ auto rng = random::RandomArrayGenerator(42);
+ auto segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);
+ int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);
+ std::sort(values, values + num_rows);
+ // num_segment_keys copies of the segment key.
+ ArrayVector segment_keys(num_segment_keys, segment_key);
+
+ BenchmarkAggregate(state, std::move(aggregates), arguments, keys,
segment_keys);
+}
+
+template <typename... Args>
+static void CountScalarSegmentedByInts(benchmark::State& state, Args&&...) {
Review Comment:
Please let me explain a bit.
Though both are named "aggregation", depending on whether there are "group
by" keys or not (take SQL `select count(*) from t group by c` and `select
count(*) from t` for instance), most compute engines have two variants of them.
Acero calls them "[scalar
aggregation](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/scalar_aggregate_node.cc)"
and "[group by
aggregation](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/groupby_aggregate_node.cc)",
and they work with aggregation functions "count/sum/..." and
"hash_count/hash_sum/..." respectively. This is understandable because w/o a
group by key, the aggregation just needs to hold one "scalar" value (e.g.,
current count/sum/...) during the whole computation, whereas a group by key
immediately requires some structures like a hash table.
And segment keys, on the other hand, working orthogonally with group by
keys, apply to both scalar and group by aggregations.
Back to your question, yes, "CountScalar" implies exactly that this is a
"scalar aggregation" (i.e., w/o any group by keys) on a "count" function.
"SegmentedByInts" implies that there are potential segment keys.
Hope this can clear things a bit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]