ales-vilchytski commented on issue #18431: URL: https://github.com/apache/arrow/issues/18431#issuecomment-1964637156
Encountered highly likely same issue Our use case: - dagster, k8s, pods have 24GB mem limit - job reads dataset of 50-100 large files (around 200MB each) one by one, processes each file with pandas then writes it back (so 1-2 files at a time are loaded into memory) - our parquet files contain few columns, but one column is a pretty large json string (300+KB) - usually job ends with OOM I can't provide our data or code, but I created a repository with smallest possible scripts to reproduce issue, it can be found here: [https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo](https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo). Repository includes scripts to generate parquet file and to reproduce OOM, Dockerfile and instructions how to run it. Issue reproduces on pyarrow (13, 14) and pandas 2+, different docker images, native MacOS ARM 13, different python version (3.10, 3.11, 3.12). Core thing ([https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/mem_leak.py#L10](https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/mem_leak.py#L10)): ``` c = 0 while True: start = time.time() data = ds.dataset('../data/example.parquet') # parquet file with large strings df = data.to_table().to_pandas() pa.Table.from_pandas(df) end = time.time() print(f'iteration {c}, time {end - start}s') c += 1 ``` As example: With 12GB memory limit script iterates about 5 times before get killed by OOM (Docker, WSL2 Ubuntu 22.04 with 16GB memory) Also I experimented with jemalloc settings and found that `JE_ARROW_MALLOC_CONF=abort_conf:true,confirm_conf :true,retain:false,background_thread:true,dirty_decay_ms:0,muzzy_decay_ms:0,lg_extent_max_active_fit:2` works a bit better. Parquet file in example is written with `object` types by default ([https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/gen_parquet.py](https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/gen_parquet.py)) but writing `string` explicitly delays OOM slightly. Any attempt to fix things by triggering GC, clearing memory pools or switching to system memory allocator failed. It still gets OOM but just earlier or later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
