This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch branch-53
in repository https://gitbox.apache.org/repos/asf/datafusion.git


The following commit(s) were added to refs/heads/branch-53 by this push:
     new 01437a2636 [branch-53] Fix duplicate group keys after hash aggregation 
spill (#20724) (#20858) (#20918)
01437a2636 is described below

commit 01437a2636473bb2532f94081eb53efa10802f48
Author: Andrew Lamb <[email protected]>
AuthorDate: Fri Mar 13 14:25:36 2026 -0400

    [branch-53] Fix duplicate group keys after hash aggregation spill (#20724) 
(#20858) (#20918)
    
    - Part of https://github.com/apache/datafusion/issues/20724
    - Closes https://github.com/apache/datafusion/issues/20724 on branch-53
    
    This PR:
    - Backports https://github.com/apache/datafusion/pull/20858 from
    @gboucher90 to the branch-53 line
    
    Co-authored-by: gboucher90 <[email protected]>
---
 datafusion/physical-plan/src/aggregates/row_hash.rs | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/datafusion/physical-plan/src/aggregates/row_hash.rs 
b/datafusion/physical-plan/src/aggregates/row_hash.rs
index 4a1b0e5c8c..8a45e4b503 100644
--- a/datafusion/physical-plan/src/aggregates/row_hash.rs
+++ b/datafusion/physical-plan/src/aggregates/row_hash.rs
@@ -1267,6 +1267,18 @@ impl GroupedHashAggregateStream {
             // on the grouping columns.
             self.group_ordering = 
GroupOrdering::Full(GroupOrderingFull::new());
 
+            // Recreate group_values to use streaming mode 
(GroupValuesColumn<true>
+            // with scalarized_intern) which preserves input row order, as 
required
+            // by GroupOrderingFull. This is only needed for multi-column 
group by,
+            // since single-column uses GroupValuesPrimitive which is always 
safe.
+            let group_schema = self
+                .spill_state
+                .merging_group_by
+                .group_schema(&self.spill_state.spill_schema)?;
+            if group_schema.fields().len() > 1 {
+                self.group_values = new_group_values(group_schema, 
&self.group_ordering)?;
+            }
+
             // Use `OutOfMemoryMode::ReportError` from this point on
             // to ensure we don't spill the spilled data to disk again.
             self.oom_mode = OutOfMemoryMode::ReportError;


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to