Mikhail created ARROW-13406:
-------------------------------

             Summary: [Python] pyarrow.array memory leak on large string arrays
                 Key: ARROW-13406
                 URL: https://issues.apache.org/jira/browse/ARROW-13406
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 4.0.1
         Environment: Linux 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 
(2020-11-28) x86_64 GNU/Linux

Darwin  19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; 
root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64
            Reporter: Mikhail


Starting from big array sizes (~500Mb) `pyarrow.array` constructor hangs and 
starts to consume memory until it's killed (by hand or by OOM).

{code:python}
import pyarrow as pa
my_string = 'a' * 40
strings = [my_string for _ in range(100_000_000)]
pyarrow_array = pa.array(x[:50_000_000]) # this works a couple of seconds
pyarrow_array = pa.array(x[:60_000_000]) # this hangs and consumes all free 
memory
{code}
 
In pyarrow==3.0.0 it works seamlessly.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to