Weston Pace created ARROW-12519: ----------------------------------- Summary: [C++] Create/document better characterization of jemalloc/mimalloc Key: ARROW-12519 URL: https://issues.apache.org/jira/browse/ARROW-12519 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace
The following script reads in a large dataset 10 times in a loop. The dataset being referred to is from Ursa benchmarks here ([https://github.com/ursacomputing/benchmarks).] However, any sufficiently large db should be sufficient. The dataset is ~5-6 GB when deserialized into an Arrow table. The conversion to a dataframe is not zero-copy and so the loop requires about 8.6GB. Running this code 10 times with mimalloc consumes 27GB of RAM. It is pretty deterministic. Even putting a 1 second sleep in between each run yields the same result. On the other hand if I put the read into its own method (second version below) then it uses only 14 GB. Our current rule of thumb seems to be "as long as the allocators stabilize to some number at some point then it is not a bug" so technically both 27GB and 14GB are valid. If we can't put any kind of bound whatsoever on the RAM that Arrow needs then it will eventually become a problem for adoption. I think we need to develop some sort of characterization around how much mimalloc/jemalloc should be allowed to over-allocate before we consider it a bug and require changing the code to avoid the situation (or documenting that certain operations are not valid). ----CODE---- // First version (uses ~27GB) {code:java} import time import pyarrow as pa import pyarrow.parquet as pq import psutil import os pa.set_memory_pool(pa.mimalloc_memory_pool()) print(pa.default_memory_pool().backend_name) for _ in range(10): table = pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet') df = table.to_pandas() print(pa.total_allocated_bytes()) proc = psutil.Process(os.getpid()) print(proc.memory_info()) {code} // Second version (uses ~14GB) {code:java} import time import pyarrow as pa import pyarrow.parquet as pq import psutil import os pa.set_memory_pool(pa.mimalloc_memory_pool()) print(pa.default_memory_pool().backend_name) def bm(): table = pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet') df = table.to_pandas() print(pa.total_allocated_bytes()) proc = psutil.Process(os.getpid()) print(proc.memory_info()) for _ in range(10): bm() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)