connec opened a new issue, #17169: URL: https://github.com/apache/datafusion/issues/17169
### Describe the bug We have a [parquet file](https://digitalsociety-public.fsn1.your-objectstorage.com/f94d0c87-8798-4bf6-9c98-8d89971e2539.parquet) (built from [public data](https://statistics.gov.scot/data/domestic-energy-performance-certificates)) with 106 columns and 1M rows which is 131.14 MiB in size (compressed, 913.89 MiB uncompressed). When running a `DISTINCT ON` query using the unbounded memory pool, memory use climbs to over 160 GiB for this query: ```sql SELECT DISTINCT ON ("ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE") * FROM table ORDER BY "ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE", "INSPECTION_DATE" DESC ``` When using a fair spill pool with 10 GiB, memory usage reaches "only" 30 GiB. These results were observed on my local machine (MacBook Pro). On a production machine with the same 10 GiB limit we have seen a graceful allocation failure: ``` Resources exhausted: Failed to allocate additional 55.0 MB for GroupedHashAggregateStream[3] ``` This makes me think it could be the same underlying issue as https://github.com/apache/datafusion/issues/13831, exacerbated by the many columns. ### To Reproduce See the parquet file and SQL query in the description above. ### Expected behavior In an ideal world, the memory usage for this query would respect the memory pool limit (or only use "small" allocations as described in [the docs](https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-design)). ### Additional context I'm happy to help diagnose this further (and potentially fix) with some advice on how to profile the memory use or narrow down the cause. For now I just wanted to capture the issue to see if it's known as I imagine it won't be an easy fix 😄 I know there are a few issues related to memory management floating around atm but none that I could see directly mentioned `DISTINCT ON`, so apologies if this is a duplicate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org