Re: [PR] GH-44052: [C++][Compute] Reduce the complexity of row segmenter [arrow]

via GitHub Tue, 17 Sep 2024 07:21:27 -0700


zanmato1984 commented on code in PR #44053:
URL: https://github.com/apache/arrow/pull/44053#discussion_r1763342723



##########
cpp/src/arrow/acero/aggregate_benchmark.cc:
##########
@@ -866,5 +887,61 @@ 
BENCHMARK(TDigestKernelDoubleMedian)->Apply(QuantileKernelArgs);
 BENCHMARK(TDigestKernelDoubleDeciles)->Apply(QuantileKernelArgs);
 BENCHMARK(TDigestKernelDoubleCentiles)->Apply(QuantileKernelArgs);
 
+//
+// Segmented Aggregate
+//
+
+static void BenchmarkSegmentedAggregate(
+    benchmark::State& state, int64_t num_rows, std::vector<Aggregate> 
aggregates,
+    const std::vector<std::shared_ptr<Array>>& arguments,
+    const std::vector<std::shared_ptr<Array>>& keys, int64_t num_segment_keys,
+    int64_t num_segments) {
+  ASSERT_GT(num_segments, 0);
+
+  auto rng = random::RandomArrayGenerator(42);
+  auto segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);
+  int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);
+  std::sort(values, values + num_rows);
+  // num_segment_keys copies of the segment key.
+  ArrayVector segment_keys(num_segment_keys, segment_key);
+
+  BenchmarkAggregate(state, std::move(aggregates), arguments, keys, 
segment_keys);
+}
+
+template <typename... Args>
+static void CountScalarSegmentedByInts(benchmark::State& state, Args&&...) {

Review Comment:
   Please let me explain a bit.
   
   Though both are named "aggregation", depending on whether there are "group 
by" keys or not (take SQL `select count(*) from t group by c` and `select 
count(*) from t` for instance), most compute engines have two variants of them. 
Acero calls them "[scalar 
aggregation](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/scalar_aggregate_node.cc)"
 and "[group by 
aggregation](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/groupby_aggregate_node.cc)",
 and they work with aggregation functions "count/sum/..." and 
"hash_count/hash_sum/..." respectively. This is understandable because w/o a 
group by key, the aggregation just needs to hold one "scalar" value (e.g., 
current count/sum/...) during the whole computation, whereas a group by key 
immediately requires some structures like a hash table.
   
   And segment keys, on the other hand, working orthogonally with group by 
keys, apply to both scalar and group by aggregations.
   
   Back to your question, yes, "CountScalar" implies exactly that this is a 
"scalar aggregation" (i.e., w/o any group by keys) on a "count" function. 
"SegmentedByInts" implies that there are potential segment keys.
   
   Hope this can clear things a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-44052: [C++][Compute] Reduce the complexity of row segmenter [arrow]

Reply via email to