[ 
https://issues.apache.org/jira/browse/ARROW-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580473#comment-17580473
 ] 

Weston Pace commented on ARROW-17441:
-------------------------------------

My suspicion would be that pa.total_allocated_bytes would be 0 (as @pitrou 
said, we are not using the Arrow memory pools here) and that remaining 500MB is 
fragmented data leftover in the system allocator.

That being said, it does beg the question why the system allocator isn't able 
to release the memory.  In other words, what's making up the fragments 
occupying those pages?  I'd guess it's some kind of python thing but it could 
possibly be some kind of global.  It's also maybe possible that the system 
allocator just doesn't try that hard to return data pages to the OS.

> [Python] Memory kept after del and pool.released_unused()
> ---------------------------------------------------------
>
>                 Key: ARROW-17441
>                 URL: https://issues.apache.org/jira/browse/ARROW-17441
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Will Jones
>            Priority: Major
>
> I was trying reproduce another issue involving memory pools not releasing 
> memory, but encountered this confusing behavior: if I create a table, then 
> call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
> significant memory usage. On mimalloc in particular, I see no meaningful drop 
> in memory usage on either call.
> Am I missing something? My understanding prior has been that memory will be 
> held onto by a memory pool, but will be forced free by release_unused; and 
> that system memory pool should release memory immediately. But neither of 
> those seem true.
> {code:python}
> import os
> import psutil
> import time
> import gc
> process = psutil.Process(os.getpid())
> import numpy as np
> from uuid import uuid4
> import pyarrow as pa
> def gen_batches(n_groups=200, rows_per_group=200_000):
>     for _ in range(n_groups):
>         id_val = uuid4().bytes
>         yield pa.table({
>             "x": np.random.random(rows_per_group), # This will compress poorly
>             "y": np.random.random(rows_per_group),
>             "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # 
> This compresses with delta encoding
>             "id": pa.array([id_val] * rows_per_group), # This compresses with 
> RLE
>         })
> def print_rss():
>     print(f"RSS: {process.memory_info().rss:,} bytes")
> print(f"memory_pool={pa.default_memory_pool().backend_name}")
> print_rss()
> print("reading table")
> tab = pa.concat_tables(list(gen_batches()))
> print_rss()
> print("deleting table")
> del tab
> gc.collect()
> print_rss()
> print("releasing unused memory")
> pa.default_memory_pool().release_unused()
> print_rss()
> print("waiting 10 seconds")
> time.sleep(10)
> print_rss()
> {code}
> {code:none}
> > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
>     ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
>     ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
> memory_pool=mimalloc
> RSS: 44,449,792 bytes
> reading table
> RSS: 1,819,557,888 bytes
> deleting table
> RSS: 1,819,590,656 bytes
> releasing unused memory
> RSS: 1,819,852,800 bytes
> waiting 10 seconds
> RSS: 1,819,852,800 bytes
> memory_pool=jemalloc
> RSS: 45,629,440 bytes
> reading table
> RSS: 1,668,677,632 bytes
> deleting table
> RSS: 698,400,768 bytes
> releasing unused memory
> RSS: 699,023,360 bytes
> waiting 10 seconds
> RSS: 699,023,360 bytes
> memory_pool=system
> RSS: 44,875,776 bytes
> reading table
> RSS: 1,713,569,792 bytes
> deleting table
> RSS: 540,311,552 bytes
> releasing unused memory
> RSS: 540,311,552 bytes
> waiting 10 seconds
> RSS: 540,311,552 bytes
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to