timothydijamco commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2619999594
> > However, I think we're still left with the general issue that memory usage is significantly higher than the amount of "real data" loaded (GBs of memory usage for MBs of real data)-- it seems like something is still accumulating? > > That might also have to do with how memory allocators work (they often keep some cache of deallocated memory for better performance instead of returning it to the OS). There are several things that you could try and report results for: > > selecting different [memory pool implementations](https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL): jemalloc, mimalloc, system > > trying to [release memory more forcibly](https://arrow.apache.org/docs/cpp/api/memory.html#_CPPv4N5arrow10MemoryPool13ReleaseUnusedEv): this is not recommended in production cases (because this makes later allocations more expensive), but can be used for experiments like this to find out the possible cause of memory consumption I printed out info about the default memory pool after every batch is read (read from the `RecordBatchReader` I created from the `Scanner`) * `total_bytes_allocated` steadily increases over time which makes sense * `bytes_allocated` fluctuates but remains capped (i.e. does not correlate with the overall memory usage of the process increasing steadily over time) Calling `arrow::default_memory_pool()->ReleaseUnused()`after every record batch is read also seems to not have an effect My shaky understanding of Arrow memory pools and allocators says this means the memory usage I'm hoping to reduce is some memory that is not allocated on the Arrow memory pool? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
