mustafasrepo commented on code in PR #6034:
URL: https://github.com/apache/arrow-datafusion/pull/6034#discussion_r1176301909


##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -663,12 +914,32 @@ impl std::fmt::Debug for AggregationState {
 }
 
 impl GroupedHashAggregateStream {
+    /// Prune the groups from the `self.aggr_state.group_states` which are in
+    /// `GroupStatus::Emitted`(this status means that result of this group 
emitted/outputted already, and
+    /// we are sure that these groups cannot receive new rows.) status.
+    fn prune(&mut self) {

Review Comment:
   `ordered_columns` store the section of the group by expression that defines 
ordering in `GroupState`. When a different `ordered_columns` is received, we 
are sure that previous groups with different `ordered_columns` are finalized 
(They will no longer receive new value). At the end of `group_aggregate_batch` 
we iterate over `self.aggr_state.group_states` and mark the groups that have 
different `ordered_columns` with the `ordered_columns` of the most recent 
(last) group as prunable.
   
   As an example, If the table is like below, and we know that it satisfies 
`ORDER BY a ASC`
   |a|
   |----|
   |1|
   |1|
   |2|
   |2|
   |3|
   |3|
   
   and group by clause is `GROUP BY a` group with `ordered_columns= 
Some(vec![1])` and `ordered_columns= Some(vec![2])` will be pruned. Since they 
are different than `ordered_columns= Some(vec![3])`. However, last group is not 
pruned because we still can receive values with 3 for column a



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to