[ 
https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186602#comment-17186602
 ] 

David Li commented on ARROW-9878:
---------------------------------

It's actually fairly complex. There are two issues here:
 # The layout of arrays/record batches in memory when they come from IPC 
"blocks" the optimization
 # jemalloc tends to cache memory, even when set to return free memory to the 
OS, so unless you are close to an OOM scenario, you will still likely observe 
doubling memory

For (1):

When you deserialize batches from IPC, memory ends up being laid out like this:

Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
Batch 2: Allocation 2: array 0 chunk 2, array 1 chunk 2, ...
...

And so (chunked) array 0 is a set of references to allocation 0, allocation 1, 
...

to_pandas operates _columnwise_, so it'll drop references to each array as it 
goes - but since each array is a slice of a larger allocation, this doesn't 
free any memory!

So first, you need something like the following:

 
{code:java}
new_batches = [
    pa.RecordBatch.from_arrays([
        pa.concat_arrrays([arr])
        for arr in batch
    ], schema=batch.schema)
    for batch in batches
]
{code}
Even then, if you measure memory usage with something like 
[memory-profiler|https://pypi.org/project/memory-profiler/], you'll still see a 
spike in memory usage unless you are close to OOM, because jemalloc caches 
allocations. You can avoid this by patching Arrow to force jemalloc to dump all 
caches after free()ing memory, though this is expensive. You can also avoid 
this by benchmarking in a scenario where the dataset is over half of available 
RAM.

[~wesm] maybe it's worth documenting these caveats somewhere?

 

> [Python] table to_pandas self_destruct=True + split_blocks=True cannot 
> prevent doubling memory
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9878
>                 URL: https://issues.apache.org/jira/browse/ARROW-9878
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.17.1, 1.0.0
>            Reporter: Weichen Xu
>            Priority: Major
>         Attachments: t001.png
>
>
> Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7
>  
> Reproduce code:
> Generate about 800MB data first.
> {code:java}
> import pyarrow as pa
> # generate about 800MB data
> data = [pa.array([10]* 1000)]
> batch = pa.record_batch(data, names=['f0'])
> with open('/tmp/t1.pa', 'wb') as f1:
>       writer = pa.ipc.new_stream(f1, batch.schema)
>       for i in range(100000):
>               writer.write_batch(batch)
>       writer.close()
> {code}
> Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
> {code:python}
> import pyarrow as pa
> import time
> import sys
> import os
> pid = os.getpid()
> print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
> sys.stdin.readline()
> with open('/tmp/t1.pa', 'rb') as f1:
>       reader = pa.ipc.open_stream(f1)
>       batches = [b for b in reader]
> pa_table = pa.Table.from_batches(batches)
> del batches
> time.sleep(3)
> pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, 
> use_threads=False)
> del pa_table
> time.sleep(3)
> {code}
> The attached file is psrecord profiling result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to