[ https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Li updated ARROW-9878: ---------------------------- Component/s: Documentation > [Python] table to_pandas self_destruct=True + split_blocks=True cannot > prevent doubling memory > ---------------------------------------------------------------------------------------------- > > Key: ARROW-9878 > URL: https://issues.apache.org/jira/browse/ARROW-9878 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python > Affects Versions: 0.17.1, 1.0.0 > Reporter: Weichen Xu > Assignee: David Li > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: t001.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7 > > Reproduce code: > Generate about 800MB data first. > {code:java} > import pyarrow as pa > # generate about 800MB data > data = [pa.array([10]* 1000)] > batch = pa.record_batch(data, names=['f0']) > with open('/tmp/t1.pa', 'wb') as f1: > writer = pa.ipc.new_stream(f1, batch.schema) > for i in range(100000): > writer.write_batch(batch) > writer.close() > {code} > Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False > {code:python} > import pyarrow as pa > import time > import sys > import os > pid = os.getpid() > print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.') > sys.stdin.readline() > with open('/tmp/t1.pa', 'rb') as f1: > reader = pa.ipc.open_stream(f1) > batches = [b for b in reader] > pa_table = pa.Table.from_batches(batches) > del batches > time.sleep(3) > pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, > use_threads=False) > del pa_table > time.sleep(3) > {code} > The attached file is psrecord profiling result. -- This message was sent by Atlassian Jira (v8.3.4#803005)