[jira] [Commented] (ARROW-9878) [Python] table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory

Wes McKinney (Jira) Sun, 30 Aug 2020 14:19:38 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187331#comment-17187331
 ]


Wes McKinney commented on ARROW-9878:
-------------------------------------

Yes, I agree they should be documented. With the way that this is set up, all 
of the columns are slices of the message bodies from the stream, so no memory 
is deallocated until all of the column references for a chunk of the table are 
released. Unfortunately there are fairly narrow cases where the 
`split_blocks=True` yields memory improvements, when the memory in the columns 
is owned and not sliced from some other memory block.

> [Python] table to_pandas self_destruct=True + split_blocks=True cannot 
> prevent doubling memory
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9878
>                 URL: https://issues.apache.org/jira/browse/ARROW-9878
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.17.1, 1.0.0
>            Reporter: Weichen Xu
>            Priority: Major
>         Attachments: t001.png
>
>
> Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7
>  
> Reproduce code:
> Generate about 800MB data first.
> {code:java}
> import pyarrow as pa
> # generate about 800MB data
> data = [pa.array([10]* 1000)]
> batch = pa.record_batch(data, names=['f0'])
> with open('/tmp/t1.pa', 'wb') as f1:
>       writer = pa.ipc.new_stream(f1, batch.schema)
>       for i in range(100000):
>               writer.write_batch(batch)
>       writer.close()
> {code}
> Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
> {code:python}
> import pyarrow as pa
> import time
> import sys
> import os
> pid = os.getpid()
> print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
> sys.stdin.readline()
> with open('/tmp/t1.pa', 'rb') as f1:
>       reader = pa.ipc.open_stream(f1)
>       batches = [b for b in reader]
> pa_table = pa.Table.from_batches(batches)
> del batches
> time.sleep(3)
> pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, 
> use_threads=False)
> del pa_table
> time.sleep(3)
> {code}
> The attached file is psrecord profiling result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9878) [Python] table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory

Reply via email to