connec opened a new issue, #17169:
URL: https://github.com/apache/datafusion/issues/17169

   ### Describe the bug
   
   We have a [parquet 
file](https://digitalsociety-public.fsn1.your-objectstorage.com/f94d0c87-8798-4bf6-9c98-8d89971e2539.parquet)
 (built from [public 
data](https://statistics.gov.scot/data/domestic-energy-performance-certificates))
 with 106 columns and 1M rows which is 131.14 MiB in size (compressed, 913.89 
MiB uncompressed).
   
   When running a `DISTINCT ON` query using the unbounded memory pool, memory 
use climbs to over 160 GiB for this query:
   
   ```sql
   SELECT DISTINCT
     ON ("ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE") *
   FROM
     table
   ORDER BY
     "ADDRESS1",
     "ADDRESS2",
     "ADDRESS3",
     "POSTCODE",
     "INSPECTION_DATE" DESC
   ```
   
   When using a fair spill pool with 10 GiB, memory usage reaches "only" 30 GiB.
   
   These results were observed on my local machine (MacBook Pro). On a 
production machine with the same 10 GiB limit we have seen a graceful 
allocation failure:
   
   ```
   Resources exhausted: Failed to allocate additional 55.0 MB for 
GroupedHashAggregateStream[3]
   ```
   
   This makes me think it could be the same underlying issue as 
https://github.com/apache/datafusion/issues/13831, exacerbated by the many 
columns.
   
   ### To Reproduce
   
   See the parquet file and SQL query in the description above.
   
   ### Expected behavior
   
   In an ideal world, the memory usage for this query would respect the memory 
pool limit (or only use "small" allocations as described in [the 
docs](https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-design)).
   
   ### Additional context
   
   I'm happy to help diagnose this further (and potentially fix) with some 
advice on how to profile the memory use or narrow down the cause. For now I 
just wanted to capture the issue to see if it's known as I imagine it won't be 
an easy fix 😄 
   
   I know there are a few issues related to memory management floating around 
atm but none that I could see directly mentioned `DISTINCT ON`, so apologies if 
this is a duplicate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to