avantgardnerio opened a new issue, #13831: URL: https://github.com/apache/datafusion/issues/13831
### Describe the bug When attempting to accumulate large text fields with a `group by`, it was observed that `group_aggregate_batch()` can OOM despite ostensibly using the `MemoryPool`. Query: ``` select truncated_time, count(*) AS cnt from ( select truncated_time, k8s_deployment_name, message from ( SELECT priorityclass, timestamp, date_trunc('day', timestamp) AS truncated_time, k8s_deployment_name, message FROM agg_oom where priorityclass != 'low' ) group by truncated_time, k8s_deployment_name, message ) group by truncated_time ``` On 8x ~50MB parquet files where the `message` column can be up to 8192 byte strings. When profiled, by far it was the largest use of memory:  When logging, we can see it fails while interning ``` converting 3 rows interning 8192 rows with 1486954 bytes interned 8192 rows, now I'm 13054176 bytes resizing to 14103171 resizing to 14103171 reserving 28206342 extra bytes converting 3 rows interning 8192 rows with 1350859 bytes memory allocation of 25690112 bytes failed Aborted (core dumped) ``` ### To Reproduce 1. set up a test with ``` let memory_limit = 125_000_000; let MEMORY_FRACTION = 1.0; let rt_config = RuntimeConfig::new() .with_memory_limit(memory_limit, MEMORY_FRACTION); ``` 2.set `ulimit -v 1152000` 3. query some parquet files with long strings ### Expected behavior `group_aggregate_batch()` doesn't make the assumption: ``` // Here we can ignore `insufficient_capacity_err` because we will spill later, // but at least one batch should fit in the memory ``` But instead realizes that adding 1 row to a million doesn't allocate 1,000,001, but rather 2,000,000 when the `Vec` exponentially resizes. ### Additional context Proposed solution: Add ``` self.reservation.try_resize(self.reservation.size() * 2)?; ``` Above ``` self.group_values .intern(group_values, &mut self.current_group_indices)?; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org