Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

via GitHub Fri, 18 Jul 2025 10:40:10 -0700


GitHub user alamb added a comment to the discussion: Best practices for 
memory-efficient deduplication of pre-sorted Parquet files


> The query plan for the original query:


So this query is ordered like
```sql
WITH ORDER (col_1 ASC, col_2 ASC) 
```

But the grouping is on all columns

```sql
    GROUP BY 
        col_1, col_2, col_3, col_4, col_5, col_6 
```

So I would expect that the partial group by stream could be used and that the 
code would be able to stream results out whenever it sees new values of `col_1` 
and `col_2`. I am not sure what is going on 

If you could make a reproducer with synthetic data and file a ticket I would be 
happy to look into this further


GitHub link: 
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13809563

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

Reply via email to