GitHub user alamb added a comment to the discussion: Best practices for
memory-efficient deduplication of pre-sorted Parquet files
👋
Give your description, I am surprised that this query is using a
HashAggregateStream -- the hash aggregate needs to buffer the entire dataset in
RAM / spill it which is why it is likely running out of memory
Given that the data is sorted by col_1 and col_2, I would expect this query to
use the streaming aggregate operatior (which should not have much memory at all)
What does the plan look like for this:
```sql
EXPLAIN SELECT
col_1,
col_2,
first_value(col_3) AS col_3
first_value(col_4) AS col_4
FROM
example
GROUP BY
col_1, col_2
ORDER BY
col_1, col_2
```
Can you get the different operator when you remove the first/last value
aggregates?
```sql
EXPLAIN SELECT
col_1,
col_2 -- NOTE remove the first_value / last_value aggregates
FROM
example
GROUP BY
col_1, col_2
ORDER BY
col_1, col_2
```
GitHub link:
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13777332
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]