Weston Pace created ARROW-12519:
-----------------------------------

             Summary: [C++] Create/document better characterization of 
jemalloc/mimalloc
                 Key: ARROW-12519
                 URL: https://issues.apache.org/jira/browse/ARROW-12519
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


The following script reads in a large dataset 10 times in a loop.  The dataset 
being referred to is from Ursa benchmarks here 
([https://github.com/ursacomputing/benchmarks).]  However, any sufficiently 
large db should be sufficient.  The dataset is ~5-6 GB when deserialized into 
an Arrow table.  The conversion to a dataframe is not zero-copy and so the loop 
requires about 8.6GB.

Running this code 10 times with mimalloc consumes 27GB of RAM.  It is pretty 
deterministic.  Even putting a 1 second sleep in between each run yields the 
same result.  On the other hand if I put the read into its own method (second 
version below) then it uses only 14 GB.

Our current rule of thumb seems to be "as long as the allocators stabilize to 
some number at some point then it is not a bug" so technically both 27GB and 
14GB are valid.

If we can't put any kind of bound whatsoever on the RAM that Arrow needs then 
it will eventually become a problem for adoption.  I think we need to develop 
some sort of characterization around how much mimalloc/jemalloc should be 
allowed to over-allocate before we consider it a bug and require changing the 
code to avoid the situation (or documenting that certain operations are not 
valid).

 

----CODE----

 

// First version (uses ~27GB)
{code:java}
import time
import pyarrow as pa
import pyarrow.parquet as pq
import psutil
import os

pa.set_memory_pool(pa.mimalloc_memory_pool())
print(pa.default_memory_pool().backend_name)

for _ in range(10):
    table = 
pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
    df = table.to_pandas()
    print(pa.total_allocated_bytes())
    proc = psutil.Process(os.getpid())
    print(proc.memory_info())
{code}
// Second version (uses ~14GB)
{code:java}
import time
import pyarrow as pa
import pyarrow.parquet as pq
import psutil
import os

pa.set_memory_pool(pa.mimalloc_memory_pool())
print(pa.default_memory_pool().backend_name)

def bm():
    table = 
pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
    df = table.to_pandas()
    print(pa.total_allocated_bytes())
    proc = psutil.Process(os.getpid())
    print(proc.memory_info())

for _ in range(10):
    bm()

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to