bharath-techie commented on issue #19386:
URL: https://github.com/apache/datafusion/issues/19386#issuecomment-3675139619

   I also saw another issue while working on this issue - say you limit the 
memory to 4 GB and now `GroupedHashAggregateStream` spills - for `URL` field - 
for each spill , it writes the entire record batch :O - so the amount it spills 
to disk exceeds 100 GB and results in resource exhausted exception.
   
   There as well , in spill manager - if I `gc` the string view - and then 
spill , it only spills the current record batch.
   
   Easily reproducible :
   
   Take one clickbench partitioned parquet file ~120 mb  :
   ```
   RUST_LOG=datafusion_physical_plan=debug ./datafusion-cli -m 40m 
--disk-spill-path /home/ec2-user/spilldir --disk-limit 75g
   
   SET datafusion.execution.listing_table_ignore_subdirectory = false;
   SET datafusion.execution.target_partitions=1;
   SET datafusion.execution.parquet.binary_as_string=true;
   CREATE EXTERNAL TABLE hits 
   STORED AS PARQUET 
   LOCATION '/home/ec2-user/hits_0.parquet';
   SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC LIMIT 
10;
   
   
   Before fix : 
   [2025-12-19T07:44:47Z DEBUG 
datafusion_physical_plan::spill::in_progress_spill_file] [SPILL_FILE] 
   Finished spill file: 
path="/home/ec2-user/spill/datafusion-xKj4Qt/.tmpoxkWIz", 
   size=820.54 MB, total_spilled_bytes=820.54 MB, total_spill_files=1
   
   
   After fix : 
   [2025-12-19T07:46:54Z DEBUG 
datafusion_physical_plan::spill::in_progress_spill_file] [SPILL_FILE] Finished 
   spill file: path="home/ec2-user/spill/datafusion-3z9mL6/.tmpF7hNi9",
    size=33.43 MB, total_spilled_bytes=33.43 MB, total_spill_files=1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to