[
https://issues.apache.org/jira/browse/ARROW-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krisztian Szucs resolved ARROW-12983.
-------------------------------------
Fix Version/s: 5.0.0
Resolution: Fixed
Issue resolved by pull request 10556
[https://github.com/apache/arrow/pull/10556]
> [C++][Python] Converter::Extend gets stuck in infinite loop causing OOM if
> values don't fit in single chunk
> -----------------------------------------------------------------------------------------------------------
>
> Key: ARROW-12983
> URL: https://issues.apache.org/jira/browse/ARROW-12983
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 4.0.0, 4.0.1
> Reporter: Laurent Mazare
> Assignee: David Li
> Priority: Major
> Labels: pull-request-available
> Fix For: 5.0.0
>
> Time Spent: 5h 10m
> Remaining Estimate: 0h
>
> _Apologies if this is a duplicate, I haven't found anything related_
> When creating an arrow table via the python api, the following code runs out
> of memory after using all the available resources on a box with 512GB of ram.
> This happens with pyarrow 4.0.0 and 4.0.1. However when running the same code
> with pyarrow 3.0.0, the memory usage only reaches 5GB (which seems like the
> appropriate ballpark for the table size).
> The code generates a table with a single string column with 1m rows, each
> string being 3000 characters long.
> Not sure whether the issue is python related or not, I haven't tried
> replicating it from the C++ api.
>
> {code:python}
> import os, string
> import numpy as np
> import pyarrow as pa
> print(pa.__version__)
> np.random.seed(42)
> alphabet = list(string.ascii_uppercase)
> _col = []
> for _n in range(1000):
> k = ''.join(np.random.choice(alphabet, 3000))
> _col += [k] * 1000
> table = pa.Table.from_pydict({'col': _col})
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)