jorisvandenbossche opened a new issue, #36378:
URL: https://github.com/apache/arrow/issues/36378

   We regularly get reports about potential memory usage issues (memory leaks, 
..), and often we need to clarify expectations around what is being observed or 
give hints on how to explore the memory usage. Given this comes up regularly, 
it might be useful to gather such content so we can point to that page instead 
of every time having to repeat it.
   
   (recent examples: https://github.com/apache/arrow/issues/36100, 
https://github.com/apache/arrow/issues/36101)
   
   Some aspects that could be useful to mention on such a page:
   
   - Some basic background on memory allocation. Of course we can't provide a 
full tutorial on this, but a few facts might help set expectations. For example 
this quote from Weston in a recent issue 
(https://github.com/apache/arrow/issues/36100#issuecomment-1599665149) to 
explain why memory usage stays high:
   
     > What is happening is that pyarrow is returning the memory back to the 
allocator (in these graphs I was using the system allocator so we are returning 
the memory to `malloc`). However, the allocator is not releasing this memory to 
the OS. This is because obtaining memory from the OS is expensive and so the 
allocator tries to avoid it if it can.
   
     Or similar comment from Antoine in 
https://github.com/apache/arrow/issues/18431#issuecomment-1377645723
   
   - List the functionality in `pyarrow` that can help diagnose or verify 
memory usage: `pa.total_allocated_bytes()`, `release_unused`, ...
   
   - More advanced, but mention there are different memory pool 
implementations, so you can also try using a different one. Each memory pool 
might also have some options to set (eg `pa.jemalloc_set_decay_ms(0)`)
   
   - General tips and tricks (eg run your reproducer multiple times in a row -> 
it might not keep increasing memory usage after the first time -> in that case 
it's not a memory leak)
   
   - Potentially mention some external tools that can help (eg 
[`memray`](https://github.com/bloomberg/memray/))
   
   Other things we could add?
   
   cc @westonpace @pitrou @AlenkaF @anjakefala 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to