westonpace commented on issue #36100:
URL: https://github.com/apache/arrow/issues/36100#issuecomment-1599665149

   I'm not entirely sure what you are expecting and I don't think `from_pylist` 
is to blame.  Debugging memory usage like this is pretty complex.  First, just 
importing the code and running it through the first time is going to load 
various shared objects, etc. into RSS.  To see this effect we can seed the 
program by creating a single row first.  I'm also going to run `construct_data` 
three times in a row.
   
   
![image](https://github.com/apache/arrow/assets/1696093/b57a5fa7-52b5-4d13-8ee6-1ab4adc589af)
   
   Now, we see that `from_pylist` (the zig-zag parts) doesn't create much more 
additional RAM, but it also doesn't appear to immediately release it.  In fact, 
it could almost look like it is leaking a bit of memory each time.  However, if 
we run `construct_data` more times we can see that this isn't actually leaking.
   
   What is happening is that pyarrow is returning the memory back to the 
allocator (in these graphs I was using the system allocator so we are returning 
the memory to `malloc`).  However, the allocator is not releasing this memory 
to the OS.  This is because obtaining memory from the OS is expensive and so 
the allocator tries to avoid it if it can.
   
   
![image](https://github.com/apache/arrow/assets/1696093/f71ccba0-4f1f-4aea-b9f0-34ed2637f4cc)
   
   We can verify this by printing `pa.total_allocated_bytes()`.  This tells us 
how much memory is allocated and not returned to the OS.  We can see that this 
is always 0.
   
   Finally, there is a method we can call, mainly for debugging purposes, to 
ask the allocator to return the memory to the OS.  What this actually does 
under the hood depends on the allocator.  For malloc, this triggers a call to 
[`malloc_trim`](https://man7.org/linux/man-pages/man3/malloc_trim.3.html).
   
   If we use `release_unused` then we see the RAM is returned to the OS after 
each call to `construct_data`.
   
   
![image](https://github.com/apache/arrow/assets/1696093/f61a9cdb-47c7-4a81-817b-af08dbd6e435)
   
   Updated example demonstrating some of these things.
   
   ```
   import pyarrow as pa
   import time
   import random
   import string
   
   def get_sample_data():
       record1 = {}
       for col_id in range(15):
           record1[f"column_{col_id}"] = string.ascii_letters[10 : 
random.randint(17, 49)]
   
       return [record1]
   
   def construct_data(data, size):
       count = 1
       while count < 10:
        pa.Table.from_pylist(data * size)
        count += 1
       return True
   
   def main():
       data = get_sample_data()
       construct_data(data, 1)
       print(f"initial seeding complete! 
total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! 
total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! 
total_allocated_bytes={pa.total_allocated_bytes()}")
       time.sleep(10)
       construct_data(data, 100000)
       pa.default_memory_pool().release_unused()
       print(f"construct data completed! 
total_allocated_bytes={pa.total_allocated_bytes()}")
   
   if __name__ == "__main__":
       main()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to