Re: [I] [Python] Memory leak in pq.read_table and table.to_pandas [arrow]

via GitHub Mon, 26 Feb 2024 08:52:58 -0800


ales-vilchytski commented on issue #18431:
URL: https://github.com/apache/arrow/issues/18431#issuecomment-1964637156


   Encountered highly likely same issue
   Our use case:
   - dagster, k8s, pods have 24GB mem limit
   - job reads dataset of 50-100 large files (around 200MB each) one by one, 
processes each file with pandas then writes it back (so 1-2 files at a time are 
loaded into memory)
   - our parquet files contain few columns, but one column is a pretty large 
json string (300+KB)
   - usually job ends with OOM
   
   I can't provide our data or code, but I created a repository with smallest 
possible scripts to reproduce issue, it can be found here: 
[https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo](https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo).
 Repository includes scripts to generate parquet file and to reproduce OOM, 
Dockerfile and instructions how to run it.
   
   
   Issue reproduces on pyarrow (13, 14) and pandas 2+, different docker images, 
native MacOS ARM 13, different python version (3.10, 3.11, 3.12).
   
   Core thing 
([https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/mem_leak.py#L10](https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/mem_leak.py#L10)):
   ```
       c = 0
       while True:
           start = time.time()
           data = ds.dataset('../data/example.parquet')  # parquet file with 
large strings
           df = data.to_table().to_pandas()
   
           pa.Table.from_pandas(df)
           
           end = time.time()
           print(f'iteration {c}, time {end - start}s')
           c += 1
   ``` 
   
   As example:
   With 12GB memory limit script iterates about 5 times before get killed by 
OOM (Docker, WSL2 Ubuntu 22.04 with 16GB memory)
   
   Also I experimented with jemalloc settings and found that 
`JE_ARROW_MALLOC_CONF=abort_conf:true,confirm_conf 
:true,retain:false,background_thread:true,dirty_decay_ms:0,muzzy_decay_ms:0,lg_extent_max_active_fit:2`
 works a bit better.
   
   Parquet file in example is written with `object` types by default 
([https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/gen_parquet.py](https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/gen_parquet.py))
 but writing `string` explicitly delays OOM slightly.
   
   Any attempt to fix things by triggering GC, clearing memory pools or 
switching to system memory allocator failed. It still gets OOM but just earlier 
or later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Memory leak in pq.read_table and table.to_pandas [arrow]

Reply via email to