Re: [PR] fix: repartition for grouping set [datafusion]

via GitHub Wed, 03 Sep 2025 06:29:26 -0700


chenkovsky commented on code in PR #16983:
URL: https://github.com/apache/datafusion/pull/16983#discussion_r2318987444



##########
datafusion/sqllogictest/test_files/aggregate.slt:
##########
@@ -7390,6 +7392,41 @@ query error Error during planning: ORDER BY and WITHIN 
GROUP clauses cannot be u
 SELECT array_agg(a_varchar order by a_varchar) WITHIN GROUP (ORDER BY 
a_varchar)
 FROM (VALUES ('a'), ('d'), ('c'), ('a')) t(a_varchar);
 
+statement ok
+SET datafusion.execution.target_partitions = 1;
+
+query TT
+EXPLAIN select * from (select 'id' as id union all select 'id' as id order by 
id) group by grouping sets ((id), ());
+----
+logical_plan
+01)Projection: id
+02)--Aggregate: groupBy=[[GROUPING SETS ((id), ())]], aggr=[[]]
+03)----Union
+04)------Projection: Utf8("id") AS id
+05)--------EmptyRelation: rows=1
+06)------Projection: Utf8("id") AS id
+07)--------EmptyRelation: rows=1
+physical_plan
+01)ProjectionExec: expr=[id@0 as id]
+02)--AggregateExec: mode=FinalPartitioned, gby=[id@0 as id, __grouping_id@1 as 
__grouping_id], aggr=[], ordering_mode=PartiallySorted([0])
+03)----CoalesceBatchesExec: target_batch_size=8192
+04)------RepartitionExec: partitioning=Hash([id@0, __grouping_id@1], 1), 
input_partitions=2

Review Comment:
   it has single partition, but multiple record batches. aggregation assumes 
that records in same group are adjacent, but it's not true for this case. 
repartition solves this problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: repartition for grouping set [datafusion]

Reply via email to