[ 
https://issues.apache.org/jira/browse/ARROW-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330948#comment-17330948
 ] 

Weston Pace commented on ARROW-12519:
-------------------------------------

[~apitrou] Yes, the RSS number.  Here is the output of the script run with 
MIMALLOC_SHOW_STATS=1
{code:java}
mimalloc
8609028224
pmem(rss=11891290112, vms=16666406912, shared=45887488, text=2023424, lib=0, 
data=16011034624, dirty=0)
8609028224
pmem(rss=17924034560, vms=25525309440, shared=47259648, text=2023424, lib=0, 
data=24869982208, dirty=0)
8609028224
pmem(rss=19656933376, vms=25793744896, shared=47259648, text=2023424, lib=0, 
data=25138417664, dirty=0)
8609028224
pmem(rss=21644206080, vms=26062180352, shared=47259648, text=2023424, lib=0, 
data=25406853120, dirty=0)
8609028224
pmem(rss=22500700160, vms=26330615808, shared=47259648, text=2023424, lib=0, 
data=25675292672, dirty=0)
8609028224
pmem(rss=23115137024, vms=26330615808, shared=46972928, text=2023424, lib=0, 
data=25675296768, dirty=0)
8609028224
pmem(rss=23457878016, vms=26330615808, shared=47063040, text=2023424, lib=0, 
data=25675296768, dirty=0)
8609028224
pmem(rss=23734255616, vms=26330615808, shared=45867008, text=2023424, lib=0, 
data=25675296768, dirty=0)
8609028224
pmem(rss=23847768064, vms=26330615808, shared=45510656, text=2023424, lib=0, 
data=25675300864, dirty=0)
8609028224
pmem(rss=23974707200, vms=26330615808, shared=45461504, text=2023424, lib=0, 
data=25675300864, dirty=0)
heap stats:     peak      total      freed       unit      count  
  reserved:    51.1 gb    53.7 gb    53.7 gb       1 b              ok
 committed:    51.1 gb    53.7 gb    53.5 gb       1 b              not all 
freed!
     reset:    84.1 mb     1.8 gb     1.8 gb       1 b              not all 
freed!
   touched:       0 b        0 b    150.1 gb       1 b              ok
  segments:     166       36.6 k     36.6 k                         ok
-abandoned:       0          0          0                           ok
   -cached:       0          0          0                           ok
     pages:     7.8 k     39.7 k     39.7 k                         ok
-abandoned:       0          0          0                           ok
 -extended:       0   
 -noretire:       0   
     mmaps:       0   
   commits:     100   
   threads:       8          8          8                           ok
  searches:     0.0 avg
numa nodes:       1
   elapsed:      57.660 s
   process: user: 169.387 s, system: 30.217 s, faults: 7, reclaims: 19183835, 
rss: 27.4 gb

{code}

> [C++] Create/document better characterization of jemalloc/mimalloc
> ------------------------------------------------------------------
>
>                 Key: ARROW-12519
>                 URL: https://issues.apache.org/jira/browse/ARROW-12519
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>         Attachments: csv-uncompressed-8core.png
>
>
> The following script reads in a large dataset 10 times in a loop.  The 
> dataset being referred to is from Ursa benchmarks here 
> ([https://github.com/ursacomputing/benchmarks).]  However, any sufficiently 
> large db should be sufficient.  The dataset is ~5-6 GB when deserialized into 
> an Arrow table.  The conversion to a dataframe is not zero-copy and so the 
> loop requires about 8.6GB.
> Running this code 10 times with mimalloc consumes 27GB of RAM.  It is pretty 
> deterministic.  Even putting a 1 second sleep in between each run yields the 
> same result.  On the other hand if I put the read into its own method (second 
> version below) then it uses only 14 GB.
> Our current rule of thumb seems to be "as long as the allocators stabilize to 
> some number at some point then it is not a bug" so technically both 27GB and 
> 14GB are valid.
> If we can't put any kind of bound whatsoever on the RAM that Arrow needs then 
> it will eventually become a problem for adoption.  I think we need to develop 
> some sort of characterization around how much mimalloc/jemalloc should be 
> allowed to over-allocate before we consider it a bug and require changing the 
> code to avoid the situation (or documenting that certain operations are not 
> valid).
>  
> ----CODE----
>  
> // First version (uses ~27GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> for _ in range(10):
>     table = 
> pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
>     df = table.to_pandas()
>     print(pa.total_allocated_bytes())
>     proc = psutil.Process(os.getpid())
>     print(proc.memory_info())
> {code}
> // Second version (uses ~14GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> def bm():
>     table = 
> pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
>     df = table.to_pandas()
>     print(pa.total_allocated_bytes())
>     proc = psutil.Process(os.getpid())
>     print(proc.memory_info())
> for _ in range(10):
>     bm()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to