westonpace commented on code in PR #14352:
URL: https://github.com/apache/arrow/pull/14352#discussion_r1019661909
##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -106,21 +106,32 @@ class ARROW_EXPORT ProjectNodeOptions : public
ExecNodeOptions {
std::vector<std::string> names;
};
-/// \brief Make a node which aggregates input batches, optionally grouped by
keys.
+/// \brief Make a node which aggregates input batches, optionally grouped by
keys and
+/// optionally segmented by segment-keys. Both keys and segment-keys determine
the group.
+/// However segment-keys are also used for determining grouping segments,
which should be
+/// large, and allow streaming a partial aggregation result after processing
each segment.
Review Comment:
> and the resulting stream will generate more batches and with lower latency.
We can maybe collect outgoing data to ensure we aren't sending anything out
of this node that is too tiny. That can be left for a future PR.
I see your point on state initialization though that is generally a pretty
cheap operation (I think). Mainly I don't want to have to burden users with
trying to optimize segment sizes for performance. That feels like a problem we
can solve internally and they should just setup their segments to match their
data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]