GitHub user alamb added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files
> The query plan for the original query: So this query is ordered like ```sql WITH ORDER (col_1 ASC, col_2 ASC) ``` But the grouping is on all columns ```sql GROUP BY col_1, col_2, col_3, col_4, col_5, col_6 ``` So I would expect that the partial group by stream could be used and that the code would be able to stream results out whenever it sees new values of `col_1` and `col_2`. I am not sure what is going on If you could make a reproducer with synthetic data and file a ticket I would be happy to look into this further GitHub link: https://github.com/apache/datafusion/discussions/16776#discussioncomment-13809563 ---- This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org