caseykneale commented on issue #7373: URL: https://github.com/apache/arrow-datafusion/issues/7373#issuecomment-1691456880
> Could it be you're reading only a part of the output (e.g. only reading one of the output partitions)? The aggregation functions I was using were counting over a column with nulls. Introducing an `ORDER BY` over the aggregation keys, before the `GROUP BY`, but after the `JOIN`s resolved the issue. All data for each table is in 1 parquet file. I am serializing the output from the query directly from the vec of record batches in a vec of simple structs. My instinct was that groupbys require ordering as there are no indices on tables, and joins are nondeterministic by design, leading too groups that are split across an intermediate form. Possibly across partitions? That's not a bug in my opinion, but worth documenting. However, if groupbys are skipping partitions when records are far enough out of order - that could be a bug? It depends on how groupbys are implemented, and what an end user should expect of them. In several dataframe technologies I've used it's been expected to have sorted the data before performing a groupby. Maybe some implementations do this for the end user without them explicitly writing it, but several I've used and made did not. So it's only a bug if this wasn't the teams intentions in my opinion :), otherwise it's a documentation issue, or there's something I'm still missing here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
