[GitHub] [arrow] westonpace commented on a diff in pull request #14352: ARROW-17642: [C++] Add ordered aggregation

GitBox Thu, 10 Nov 2022 14:17:39 -0800


westonpace commented on code in PR #14352:
URL: https://github.com/apache/arrow/pull/14352#discussion_r1019661909



##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -106,21 +106,32 @@ class ARROW_EXPORT ProjectNodeOptions : public 
ExecNodeOptions {
   std::vector<std::string> names;
 };
 
-/// \brief Make a node which aggregates input batches, optionally grouped by 
keys.
+/// \brief Make a node which aggregates input batches, optionally grouped by 
keys and
+/// optionally segmented by segment-keys. Both keys and segment-keys determine 
the group.
+/// However segment-keys are also used for determining grouping segments, 
which should be
+/// large, and allow streaming a partial aggregation result after processing 
each segment.

Review Comment:
   > and the resulting stream will generate more batches and with lower latency.
   
   We can maybe collect outgoing data to ensure we aren't sending anything out 
of this node that is too tiny.  That can be left for a future PR.
   
   I see your point on state initialization though that is generally a pretty 
cheap operation (I think).  Mainly I don't want to have to burden users with 
trying to optimize segment sizes for performance.  That feels like a problem we 
can solve internally and they should just setup their segments to match their 
data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #14352: ARROW-17642: [C++] Add ordered aggregation

Reply via email to