Laurent Mazare created ARROW-12983:
--------------------------------------
Summary: Very large memory consumption when building a table
Key: ARROW-12983
URL: https://issues.apache.org/jira/browse/ARROW-12983
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 4.0.1, 4.0.0
Reporter: Laurent Mazare
_Apologies if this is a duplicate, I haven't found anything related_
When creating an arrow table via the python api, the following code runs out of
memory after using all the available resources on a box with 512GB of ram. This
happens with pyarrow 4.0.0 and 4.0.1. However when running the same code with
pyarrow 3.0.0, the memory usage only reaches 5GB (which seems like the
appropriate ballpark for the table size).
The code generates a table with a single string column with 1m rows, each
string being 3000 characters long.
Not sure whether the issue is python related or not, I haven't tried
replicating it from the C++ api.
{code:python}
import os, string
import numpy as np
import pyarrow as pa
print(pa.__version__)
np.random.seed(42)
alphabet = list(string.ascii_uppercase)
_col = []
for _n in range(1000):
k = ''.join(np.random.choice(alphabet, 3000))
_col += [k] * 1000
table = pa.Table.from_pydict({'col': _col})
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)