Mikhail created ARROW-13406:
-------------------------------
Summary: [Python] pyarrow.array memory leak on large string arrays
Key: ARROW-13406
URL: https://issues.apache.org/jira/browse/ARROW-13406
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 4.0.1
Environment: Linux 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2
(2020-11-28) x86_64 GNU/Linux
Darwin 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020;
root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64
Reporter: Mikhail
Starting from big array sizes (~500Mb) `pyarrow.array` constructor hangs and
starts to consume memory until it's killed (by hand or by OOM).
{code:python}
import pyarrow as pa
my_string = 'a' * 40
strings = [my_string for _ in range(100_000_000)]
pyarrow_array = pa.array(x[:50_000_000]) # this works a couple of seconds
pyarrow_array = pa.array(x[:60_000_000]) # this hangs and consumes all free
memory
{code}
In pyarrow==3.0.0 it works seamlessly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)