Re: [I] [Parquet] >2GiB Memory Leak on reading single parquet metadata file [arrow]

via GitHub Thu, 14 Nov 2024 09:02:55 -0800


jonded94 commented on issue #44599:
URL: https://github.com/apache/arrow/issues/44599#issuecomment-2476922797


   > which platform are you running this on?
   
   ```
   $ uname -a
    6.6.30-flatcar #1 SMP PREEMPT_DYNAMIC Sun May 19 16:12:26 -00 2024 x86_64 
GNU/Linux
   $ python -c "import pyarrow; print(pyarrow.__version__)"
   17.0.0
   ```
   > did you try changing the default memory pool? 
   
   "system" memory pool:
   ```
   $ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py 
repartition/* 3
   load: repartition/part-332.parquet
    took 0.00124 s, mem diff 1.500MiB [start: 194.336MiB, end: 195.836MiB]0, 0
   load: repartition/_metadata
    took 2.41739 s, mem diff 2077.484MiB [start: 195.836MiB, end: 
2273.320MiB]0, 0
   load: repartition/_metadata
    took 2.25331 s, mem diff 400.500MiB [start: 2273.320MiB, end: 
2673.820MiB]0, 0
   load: repartition/part-759.parquet
    took 0.00094 s, mem diff 1.500MiB [start: 2673.820MiB, end: 2675.320MiB]0, 0
   ```
   => also ~2.3GiB leak at *first* read, but a second read even increases it to 
~2.7GiB.
   
   "jemalloc" memory pool:
   ```
   $ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py 
repartition/* 3
   load: repartition/part-58.parquet
    took 0.00122 s, mem diff 1.500MiB [start: 198.316MiB, end: 199.816MiB]0, 0
   load: repartition/_metadata
    took 2.40700 s, mem diff 2081.992MiB [start: 199.816MiB, end: 
2281.809MiB]0, 0
   load: repartition/part-40.parquet
    took 0.00101 s, mem diff 1.500MiB [start: 2281.809MiB, end: 2283.309MiB]0, 0
   load: repartition/_metadata
    took 2.19111 s, mem diff 17.043MiB [start: 2283.309MiB, end: 2300.352MiB]0, 0
   gc:   repartition/part-912.parquet
    took 0.00223 s, mem diff -2.000MiB [start: 2300.352MiB, end: 2298.352MiB]0, 0
   load: repartition/_metadata
    took 2.16370 s, mem diff 0.629MiB [start: 2298.352MiB, end: 2298.980MiB]0, 0
   ```
   "mimalloc" memory pool:
   ```
   $ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py 
repartition/* 3
   load: repartition/part-887.parquet
    took 0.00150 s, mem diff 5.285MiB [start: 189.285MiB, end: 194.570MiB]0, 0
   load: repartition/_metadata
    took 2.40356 s, mem diff 2078.855MiB [start: 194.570MiB, end: 
2273.426MiB]0, 0
   load: repartition/_metadata
    took 2.25088 s, mem diff 15.887MiB [start: 2273.426MiB, end: 2289.312MiB]0, 0
   load: repartition/_metadata
    took 2.23642 s, mem diff -0.391MiB [start: 2289.312MiB, end: 2288.922MiB]0, 0
   ```
   
   Then I tuned a bit the statment where I previously just called 
`gc.collect()` and added a memory pool cleanup:
   ```
           with profiling(f"gc:   {path}"):
               pyarrow.default_memory_pool().release_unused()
               gc.collect()
   ```
   
   Following datapoints resulted with this tuned version of the script.
   
   "system" memory pool:
   ```
   $ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py 
repartition/* 3
   load: repartition/part-419.parquet
    took 0.00142 s, mem diff 3.000MiB [start: 192.777MiB, end: 195.777MiB]0, 0
   gc:   repartition/part-419.parquet
    took 0.00343 s, mem diff -1.695MiB [start: 195.777MiB, end: 194.082MiB]0, 0
   load: repartition/_metadata
    took 2.40818 s, mem diff 2080.812MiB [start: 194.082MiB, end: 
2274.895MiB]0, 0
   gc:   repartition/_metadata
    took 0.17570 s, mem diff -2079.277MiB [start: 2274.895MiB, end: 
195.617MiB]0, 0
   load: repartition/_metadata
    took 3.15845 s, mem diff 2476.500MiB [start: 195.617MiB, end: 
2672.117MiB]0, 0
   gc:   repartition/_metadata
    took 0.20289 s, mem diff -2481.230MiB [start: 2672.117MiB, end: 
190.887MiB]0, 0
   load: repartition/_metadata
    took 3.10904 s, mem diff 2479.500MiB [start: 190.887MiB, end: 
2670.387MiB]0, 0
   gc:   repartition/_metadata
    took 0.18726 s, mem diff -2479.953MiB [start: 2670.387MiB, end: 
190.434MiB]0, 0
   load: repartition/part-1549.parquet
    took 0.00179 s, mem diff 1.500MiB [start: 190.434MiB, end: 191.934MiB]0, 0
   ```
   => Here, the memory *actually* seems to be released to the system again!
   
   "jemalloc" memory pool:
   ```
   $ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py 
repartition/* 3
   load: repartition/part-426.parquet
    took 0.00138 s, mem diff 3.000MiB [start: 193.805MiB, end: 196.805MiB]0, 0
   gc:   repartition/part-426.parquet
    took 0.00635 s, mem diff 26.812MiB [start: 196.805MiB, end: 223.617MiB]0, 0
   load: repartition/_metadata
    took 2.48826 s, mem diff 2082.836MiB [start: 223.617MiB, end: 
2306.453MiB]0, 0
   load: repartition/_metadata
    took 2.23224 s, mem diff 16.172MiB [start: 2306.453MiB, end: 2322.625MiB]0, 0
   gc:   repartition/part-267.parquet
    took 0.00250 s, mem diff -1.762MiB [start: 2322.625MiB, end: 2320.863MiB]0, 0
   gc:   repartition/part-839.parquet
    took 0.00231 s, mem diff -1.988MiB [start: 2320.863MiB, end: 2318.875MiB]0, 0
   load: repartition/_metadata
    took 2.18774 s, mem diff 0.621MiB [start: 2318.875MiB, end: 2319.496MiB]0, 0
   ```
   => jemalloc memory pool doesn't care for the explicit cleanup call 
apparently.
   
   "mimalloc" memory pool (this was increasing & releasing memory on every 
iteration and spamming stdout, I had to limit output to cases where memory 
changed by more than 5MiB):
   ```
   $ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py 
repartition/* 3
   load: repartition/_metadata
    took 2.37767 s, mem diff 2079.824MiB [start: 193.172MiB, end: 
2272.996MiB]0, 0
   load: repartition/_metadata
    took 2.14551 s, mem diff 17.660MiB [start: 2271.785MiB, end: 2289.445MiB]0, 0
   ```
   => Memory leak also still seems to appear.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Parquet] >2GiB Memory Leak on reading single parquet metadata file [arrow]

Reply via email to