Re: [I] Different results in TPC-DS q14 depending on the number of partitions [datafusion]

via GitHub Fri, 16 Jan 2026 05:30:41 -0800


gene-bordegaray commented on issue #19849:
URL: https://github.com/apache/datafusion/issues/19849#issuecomment-3760036310


   I belive the issue if due to grouping sets. They are implemented as a single 
aggregate which emits a (full key, partial key, total) by extending some 
columns with NULL. This causes the partial aggregate to say that it's output 
hash partitioned by the group by key causing the incorrect results.
   
   Take this example:
   
   Input rows (hash partitioned by (a,b))
   
   - P1: (a=1,b=10, val=5)   (a=2,b=20, val=7)
   
   - P2: (a=1,b=11, val=3)   (a=2,b=21, val=4)
   
   ROLLUP(a,b) produces groups:
     - (a,b) -> (full key)
     - (a,NULL) -> (rollup)
     - (NULL,NULL) -> (grand total)
   
   Partial aggregate output per partition:
   P1:
   - (1,10) -> sum=5
   - (2,20) -> sum=7
   - (1,NULL) -> sum=5
   - (2,NULL) -> sum=7
   - (NULL,NULL) -> sum=12
   
   P2:
   - (1,11) -> sum=3
   - (2,21) -> sum=4
   - (1,NULL) -> sum=3
   - (2,NULL) -> sum=4
   - (NULL,NULL) -> sum=7
   
   FinalPartitioned aggregate wrongly assumes each group key is fully contained 
in a single partition since (a, NULL) satisfies (a, b) as its a superset, so it 
does not merge across partitions.
   Result:
   - (1,NULL) appears twice (P1 -> 5 and P2 -> 3)
   - (NULL,NULL) appears twice (P1 -> 12 and P2 -> 7)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Different results in TPC-DS q14 depending on the number of partitions [datafusion]

Reply via email to