alamb commented on issue #13831:
URL: https://github.com/apache/datafusion/issues/13831#issuecomment-2553346538

   This is a good find.
   
   Given your description it sounds like the storage for the group values is 
what is taking the memory
   
   ```sql
   group by truncated_time, k8s_deployment_name, message
   ```
   
   The aggregate operator does account for the memory stored in the groups here:
   
   
https://github.com/apache/datafusion/blob/63ce4865896b906ca34fcbf85fdc55bff3080c30/datafusion/physical-plan/src/aggregates/row_hash.rs#L900-L899
   
   However,  I believe the memory accounting is only updated after processing 
an entire batch of values.
   
   So for example, if you are using a batch of 8000 rows and each row has 
values 8k, that means at least 256 MB will be allocated (and since your query 
has 3 columns that may be even higher)
   
   I can think of two possible solutuions:
   1. Use a smaller batch size for such queries so the memory accounting is 
more fine grained
   2. Leave more "slop" in the configured limits (e.g. set the maximum memory 
limit to be 500MB less than you actually have, for example)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to