[jira] [Updated] (ARROW-9878) arrow table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory.

Weichen Xu (Jira) Thu, 27 Aug 2020 20:46:38 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Weichen Xu updated ARROW-9878:
------------------------------
    Description: 
Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7

 

Reproduce code:

Generate about 800MB data first.
{code:java}
import pyarrow as pa

# generate about 800MB data
data = [pa.array([10]* 1000)]
batch = pa.record_batch(data, names=['f0'])
with open('/tmp/t1.pa', 'wb') as f1:
        writer = pa.ipc.new_stream(f1, batch.schema)
        for i in range(100000):
                writer.write_batch(batch)
        writer.close()
{code}
Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
{code:python}
import pyarrow as pa
import time
import sys

import os
pid = os.getpid()
print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
sys.stdin.readline()

with open('/tmp/t1.pa', 'rb') as f1:
        reader = pa.ipc.open_stream(f1)
        batches = [b for b in reader]

pa_table = pa.Table.from_batches(batches)
del batches
time.sleep(3)
pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, 
use_threads=False)
del pa_table
time.sleep(3)
{code}
The attached file is psrecord profiling result.

  was:
Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7

 

Reproduce code:

Generate about 800MB data first.
{code:java}
Unable to find source-code formatter for language: python. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlimport pyarrow as pa

# generate about 800MB data
data = [pa.array([10]* 1000)]
batch = pa.record_batch(data, names=['f0'])
with open('/tmp/t1.pa', 'wb') as f1:
        writer = pa.ipc.new_stream(f1, batch.schema)
        for i in range(100000):
                writer.write_batch(batch)
        writer.close()
{code}
Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
{code:python}
import pyarrow as pa
import time
import sys

import os
pid = os.getpid()
print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
sys.stdin.readline()

with open('/tmp/t1.pa', 'rb') as f1:
        reader = pa.ipc.open_stream(f1)
        batches = [b for b in reader]

pa_table = pa.Table.from_batches(batches)
del batches
time.sleep(3)
pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, 
use_threads=False)
del pa_table
time.sleep(3)
{code}
The attached file is psrecord profiling result.


> arrow table to_pandas self_destruct=True + split_blocks=True cannot prevent 
> doubling memory.
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9878
>                 URL: https://issues.apache.org/jira/browse/ARROW-9878
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.17.1, 1.0.0
>            Reporter: Weichen Xu
>            Priority: Major
>         Attachments: t001.png
>
>
> Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7
>  
> Reproduce code:
> Generate about 800MB data first.
> {code:java}
> import pyarrow as pa
> # generate about 800MB data
> data = [pa.array([10]* 1000)]
> batch = pa.record_batch(data, names=['f0'])
> with open('/tmp/t1.pa', 'wb') as f1:
>       writer = pa.ipc.new_stream(f1, batch.schema)
>       for i in range(100000):
>               writer.write_batch(batch)
>       writer.close()
> {code}
> Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
> {code:python}
> import pyarrow as pa
> import time
> import sys
> import os
> pid = os.getpid()
> print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
> sys.stdin.readline()
> with open('/tmp/t1.pa', 'rb') as f1:
>       reader = pa.ipc.open_stream(f1)
>       batches = [b for b in reader]
> pa_table = pa.Table.from_batches(batches)
> del batches
> time.sleep(3)
> pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, 
> use_threads=False)
> del pa_table
> time.sleep(3)
> {code}
> The attached file is psrecord profiling result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9878) arrow table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory.

Reply via email to