[ https://issues.apache.org/jira/browse/ARROW-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330948#comment-17330948 ]
Weston Pace commented on ARROW-12519: ------------------------------------- [~apitrou] Yes, the RSS number. Here is the output of the script run with MIMALLOC_SHOW_STATS=1 {code:java} mimalloc 8609028224 pmem(rss=11891290112, vms=16666406912, shared=45887488, text=2023424, lib=0, data=16011034624, dirty=0) 8609028224 pmem(rss=17924034560, vms=25525309440, shared=47259648, text=2023424, lib=0, data=24869982208, dirty=0) 8609028224 pmem(rss=19656933376, vms=25793744896, shared=47259648, text=2023424, lib=0, data=25138417664, dirty=0) 8609028224 pmem(rss=21644206080, vms=26062180352, shared=47259648, text=2023424, lib=0, data=25406853120, dirty=0) 8609028224 pmem(rss=22500700160, vms=26330615808, shared=47259648, text=2023424, lib=0, data=25675292672, dirty=0) 8609028224 pmem(rss=23115137024, vms=26330615808, shared=46972928, text=2023424, lib=0, data=25675296768, dirty=0) 8609028224 pmem(rss=23457878016, vms=26330615808, shared=47063040, text=2023424, lib=0, data=25675296768, dirty=0) 8609028224 pmem(rss=23734255616, vms=26330615808, shared=45867008, text=2023424, lib=0, data=25675296768, dirty=0) 8609028224 pmem(rss=23847768064, vms=26330615808, shared=45510656, text=2023424, lib=0, data=25675300864, dirty=0) 8609028224 pmem(rss=23974707200, vms=26330615808, shared=45461504, text=2023424, lib=0, data=25675300864, dirty=0) heap stats: peak total freed unit count reserved: 51.1 gb 53.7 gb 53.7 gb 1 b ok committed: 51.1 gb 53.7 gb 53.5 gb 1 b not all freed! reset: 84.1 mb 1.8 gb 1.8 gb 1 b not all freed! touched: 0 b 0 b 150.1 gb 1 b ok segments: 166 36.6 k 36.6 k ok -abandoned: 0 0 0 ok -cached: 0 0 0 ok pages: 7.8 k 39.7 k 39.7 k ok -abandoned: 0 0 0 ok -extended: 0 -noretire: 0 mmaps: 0 commits: 100 threads: 8 8 8 ok searches: 0.0 avg numa nodes: 1 elapsed: 57.660 s process: user: 169.387 s, system: 30.217 s, faults: 7, reclaims: 19183835, rss: 27.4 gb {code} > [C++] Create/document better characterization of jemalloc/mimalloc > ------------------------------------------------------------------ > > Key: ARROW-12519 > URL: https://issues.apache.org/jira/browse/ARROW-12519 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > Attachments: csv-uncompressed-8core.png > > > The following script reads in a large dataset 10 times in a loop. The > dataset being referred to is from Ursa benchmarks here > ([https://github.com/ursacomputing/benchmarks).] However, any sufficiently > large db should be sufficient. The dataset is ~5-6 GB when deserialized into > an Arrow table. The conversion to a dataframe is not zero-copy and so the > loop requires about 8.6GB. > Running this code 10 times with mimalloc consumes 27GB of RAM. It is pretty > deterministic. Even putting a 1 second sleep in between each run yields the > same result. On the other hand if I put the read into its own method (second > version below) then it uses only 14 GB. > Our current rule of thumb seems to be "as long as the allocators stabilize to > some number at some point then it is not a bug" so technically both 27GB and > 14GB are valid. > If we can't put any kind of bound whatsoever on the RAM that Arrow needs then > it will eventually become a problem for adoption. I think we need to develop > some sort of characterization around how much mimalloc/jemalloc should be > allowed to over-allocate before we consider it a bug and require changing the > code to avoid the situation (or documenting that certain operations are not > valid). > > ----CODE---- > > // First version (uses ~27GB) > {code:java} > import time > import pyarrow as pa > import pyarrow.parquet as pq > import psutil > import os > pa.set_memory_pool(pa.mimalloc_memory_pool()) > print(pa.default_memory_pool().backend_name) > for _ in range(10): > table = > pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet') > df = table.to_pandas() > print(pa.total_allocated_bytes()) > proc = psutil.Process(os.getpid()) > print(proc.memory_info()) > {code} > // Second version (uses ~14GB) > {code:java} > import time > import pyarrow as pa > import pyarrow.parquet as pq > import psutil > import os > pa.set_memory_pool(pa.mimalloc_memory_pool()) > print(pa.default_memory_pool().backend_name) > def bm(): > table = > pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet') > df = table.to_pandas() > print(pa.total_allocated_bytes()) > proc = psutil.Process(os.getpid()) > print(proc.memory_info()) > for _ in range(10): > bm() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)