[
https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187331#comment-17187331
]
Wes McKinney commented on ARROW-9878:
-------------------------------------
Yes, I agree they should be documented. With the way that this is set up, all
of the columns are slices of the message bodies from the stream, so no memory
is deallocated until all of the column references for a chunk of the table are
released. Unfortunately there are fairly narrow cases where the
`split_blocks=True` yields memory improvements, when the memory in the columns
is owned and not sliced from some other memory block.
> [Python] table to_pandas self_destruct=True + split_blocks=True cannot
> prevent doubling memory
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-9878
> URL: https://issues.apache.org/jira/browse/ARROW-9878
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 0.17.1, 1.0.0
> Reporter: Weichen Xu
> Priority: Major
> Attachments: t001.png
>
>
> Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7
>
> Reproduce code:
> Generate about 800MB data first.
> {code:java}
> import pyarrow as pa
> # generate about 800MB data
> data = [pa.array([10]* 1000)]
> batch = pa.record_batch(data, names=['f0'])
> with open('/tmp/t1.pa', 'wb') as f1:
> writer = pa.ipc.new_stream(f1, batch.schema)
> for i in range(100000):
> writer.write_batch(batch)
> writer.close()
> {code}
> Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
> {code:python}
> import pyarrow as pa
> import time
> import sys
> import os
> pid = os.getpid()
> print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
> sys.stdin.readline()
> with open('/tmp/t1.pa', 'rb') as f1:
> reader = pa.ipc.open_stream(f1)
> batches = [b for b in reader]
> pa_table = pa.Table.from_batches(batches)
> del batches
> time.sleep(3)
> pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True,
> use_threads=False)
> del pa_table
> time.sleep(3)
> {code}
> The attached file is psrecord profiling result.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)