Re: [PR] GH-44052: [C++][Compute] Reduce the complexity of row segmenter [arrow]

via GitHub Tue, 17 Sep 2024 20:36:15 -0700


zanmato1984 commented on code in PR #44053:
URL: https://github.com/apache/arrow/pull/44053#discussion_r1764345867



##########
cpp/src/arrow/acero/aggregate_benchmark.cc:
##########
@@ -866,5 +887,61 @@ 
BENCHMARK(TDigestKernelDoubleMedian)->Apply(QuantileKernelArgs);
 BENCHMARK(TDigestKernelDoubleDeciles)->Apply(QuantileKernelArgs);
 BENCHMARK(TDigestKernelDoubleCentiles)->Apply(QuantileKernelArgs);
 
+//
+// Segmented Aggregate
+//
+
+static void BenchmarkSegmentedAggregate(
+    benchmark::State& state, int64_t num_rows, std::vector<Aggregate> 
aggregates,
+    const std::vector<std::shared_ptr<Array>>& arguments,
+    const std::vector<std::shared_ptr<Array>>& keys, int64_t num_segment_keys,
+    int64_t num_segments) {
+  ASSERT_GT(num_segments, 0);
+
+  auto rng = random::RandomArrayGenerator(42);
+  auto segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);
+  int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);
+  std::sort(values, values + num_rows);
+  // num_segment_keys copies of the segment key.
+  ArrayVector segment_keys(num_segment_keys, segment_key);
+
+  BenchmarkAggregate(state, std::move(aggregates), arguments, keys, 
segment_keys);
+}
+
+template <typename... Args>
+static void CountScalarSegmentedByInts(benchmark::State& state, Args&&...) {

Review Comment:
   You are right, what this benchmark does is not simply a `select count(*) 
from t`. This is merely to explain the difference between a scalar agg and a 
group by agg. But that doesn't mean this benchmark is a group by - at least not 
in Acero.
   
   Though I'm not entirely sure how the segmented non-group-by agg map to a SQL 
basis, in Acero, specifying segment keys doesn't actually require group-by's 
existence. One just specifies segment keys and group by keys (independently) 
within 
https://github.com/apache/arrow/blob/9576a41001fe883ab4b6538663647d7c602fc4df/cpp/src/arrow/acero/options.h#L332-L348
 and call an "aggregate" node. The plan parsing will generate scalar agg or 
group by agg depending on if there are group by keys and assign segment keys to 
whichever of them.
   
   So I would let the naming reflect the underlying implementation (scalar agg 
vs. group by agg).
   
   (I understand the argument could be: a segment key still implies group by 
semantic. I agree. But that's a more end-to-end perspective and doesn't reflect 
how we handle the segment key).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-44052: [C++][Compute] Reduce the complexity of row segmenter [arrow]

Reply via email to