pitrou commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2612692963
> * **After the read is done (i.e. the Jupyter notebook cell running the read completes):** the memory usage still hasn't decreased Ok, I don't know how Jupyter works in that regard, but I know that the IPython console (in the command line) keeps past results alive by default. See [%reset](https://ipythonbook.com/magic/reset.html). > I ran the repro on your patch commit as well (https://github.com/apache/arrow/issues/37630) and memory usage is a quarter of what it was without the patch! Great, thank you! > However, I think we're still left with the general issue that memory usage is significantly higher than the amount of "real data" loaded (GBs of memory usage for MBs of real data)-- it seems like something is still accumulating? That might also have to do with how memory allocators work (they often keep some cache of deallocated memory for better performance instead of returning it to the OS). There are several things that you could try and report results for: * selecting different [memory pool implementations](https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL): jemalloc, mimalloc, system * trying to [release memory more forcibly](https://arrow.apache.org/docs/cpp/api/memory.html#_CPPv4N5arrow10MemoryPool13ReleaseUnusedEv): this is not recommended in production cases (because this makes later allocations more expensive), but can be used for experiments like this to find out the possible cause of memory consumption -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
