rtpsw commented on code in PR #14352:
URL: https://github.com/apache/arrow/pull/14352#discussion_r1019914258


##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -106,21 +106,32 @@ class ARROW_EXPORT ProjectNodeOptions : public 
ExecNodeOptions {
   std::vector<std::string> names;
 };
 
-/// \brief Make a node which aggregates input batches, optionally grouped by 
keys.
+/// \brief Make a node which aggregates input batches, optionally grouped by 
keys and
+/// optionally segmented by segment-keys. Both keys and segment-keys determine 
the group.
+/// However segment-keys are also used for determining grouping segments, 
which should be
+/// large, and allow streaming a partial aggregation result after processing 
each segment.

Review Comment:
   > We can maybe collect outgoing data to ensure we aren't sending anything 
out of this node that is too tiny. That can be left for a future PR.
   
   Agreed.
   
   > I see your point on state initialization though that is generally a pretty 
cheap operation (I think). Mainly I don't want to have to burden users with 
trying to optimize segment sizes for performance. That feels like a problem we 
can solve internally and they should just setup their segments to match their 
data.
   
   I agree we could deal with size optimizations internally - I'd prefer in a 
separate jira. It's actually the output batch size being optimized because the 
segments, and their sizes, are determined by the input data. Note that the cost 
of a small segment sizes may not be trivial as it also includes the overhead of 
finishing (see [this 
post](https://github.com/apache/arrow/pull/14352#discussion_r1019650164)), at 
least with the current design. It also includes the usual processing cost due 
to outputting more batches.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to