[ https://issues.apache.org/jira/browse/ARROW-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330472#comment-17330472 ]
Jonathan Keane commented on ARROW-12519: ---------------------------------------- I've also attached some more recent R benchmarks (the 7 April HEAD uses mimalloc 2.0 and 6 April uses 1.6 — we are now using 1.6 because we saw some regressions in the c++ microbenchmarks with 2.0) We see the same increase in durection for fanniemae here, but not for the other datasets (well, except for with 3.0 which used jemalloc, but that's the same as ^^^) Oddly(?) the files that are individual types don't seem to exhibit this pattern anywhere (though those files are much smaller / quicker so maybe this only impacts larger files somehow?) > [C++] Create/document better characterization of jemalloc/mimalloc > ------------------------------------------------------------------ > > Key: ARROW-12519 > URL: https://issues.apache.org/jira/browse/ARROW-12519 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > Attachments: csv-uncompressed-8core.png > > > The following script reads in a large dataset 10 times in a loop. The > dataset being referred to is from Ursa benchmarks here > ([https://github.com/ursacomputing/benchmarks).] However, any sufficiently > large db should be sufficient. The dataset is ~5-6 GB when deserialized into > an Arrow table. The conversion to a dataframe is not zero-copy and so the > loop requires about 8.6GB. > Running this code 10 times with mimalloc consumes 27GB of RAM. It is pretty > deterministic. Even putting a 1 second sleep in between each run yields the > same result. On the other hand if I put the read into its own method (second > version below) then it uses only 14 GB. > Our current rule of thumb seems to be "as long as the allocators stabilize to > some number at some point then it is not a bug" so technically both 27GB and > 14GB are valid. > If we can't put any kind of bound whatsoever on the RAM that Arrow needs then > it will eventually become a problem for adoption. I think we need to develop > some sort of characterization around how much mimalloc/jemalloc should be > allowed to over-allocate before we consider it a bug and require changing the > code to avoid the situation (or documenting that certain operations are not > valid). > > ----CODE---- > > // First version (uses ~27GB) > {code:java} > import time > import pyarrow as pa > import pyarrow.parquet as pq > import psutil > import os > pa.set_memory_pool(pa.mimalloc_memory_pool()) > print(pa.default_memory_pool().backend_name) > for _ in range(10): > table = > pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet') > df = table.to_pandas() > print(pa.total_allocated_bytes()) > proc = psutil.Process(os.getpid()) > print(proc.memory_info()) > {code} > // Second version (uses ~14GB) > {code:java} > import time > import pyarrow as pa > import pyarrow.parquet as pq > import psutil > import os > pa.set_memory_pool(pa.mimalloc_memory_pool()) > print(pa.default_memory_pool().backend_name) > def bm(): > table = > pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet') > df = table.to_pandas() > print(pa.total_allocated_bytes()) > proc = psutil.Process(os.getpid()) > print(proc.memory_info()) > for _ in range(10): > bm() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)