caseykneale commented on issue #7373:
URL: 
https://github.com/apache/arrow-datafusion/issues/7373#issuecomment-1691456880

   > Could it be you're reading only a part of the output (e.g. only reading 
one of the output partitions)?
   
   The aggregation functions I was using were counting over a column with 
nulls. Introducing an `ORDER BY` over the aggregation keys, before the `GROUP 
BY`, but after the `JOIN`s resolved the issue.
   
   All data for each table is in 1 parquet file. I am serializing the output 
from the query directly from the vec of record batches in a vec of simple 
structs.
   
   My instinct was that groupbys require ordering as there are no indices on 
tables, and joins are nondeterministic by design, leading too groups that are 
split across an intermediate form. Possibly across partitions? That's not a bug 
in my opinion, but worth documenting. However, if groupbys are skipping 
partitions when records are far enough out of order - that could be a bug? It 
depends on how groupbys are implemented, and what an end user should expect of 
them.
   
   In several dataframe technologies I've used it's been expected to have 
sorted the data before performing a groupby. Maybe some implementations do this 
for the end user without them explicitly writing it, but several I've used and 
made did not. So it's only a bug if this wasn't the teams intentions in my 
opinion :), otherwise it's a documentation issue, or there's something I'm 
still missing here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to