[ 
https://issues.apache.org/jira/browse/ARROW-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-12983:
-----------------------------
    Summary: [C++] Converter::Extend gets stuck in infinite loop causing OOM if 
values don't fit in single chunk  (was: Very large memory consumption when 
building a table)

> [C++] Converter::Extend gets stuck in infinite loop causing OOM if values 
> don't fit in single chunk
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12983
>                 URL: https://issues.apache.org/jira/browse/ARROW-12983
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 4.0.0, 4.0.1
>            Reporter: Laurent Mazare
>            Priority: Major
>
> _Apologies if this is a duplicate, I haven't found anything related_
> When creating an arrow table via the python api, the following code runs out 
> of memory after using all the available resources on a box with 512GB of ram. 
> This happens with pyarrow 4.0.0 and 4.0.1. However when running the same code 
> with pyarrow 3.0.0, the memory usage only reaches 5GB (which seems like the 
> appropriate ballpark for the table size).
>  The code generates a table with a single string column with 1m rows, each 
> string being 3000 characters long.
> Not sure whether the issue is python related or not, I haven't tried 
> replicating it from the C++ api.
>  
> {code:python}
> import os, string
> import numpy as np
> import pyarrow as pa
> print(pa.__version__)
> np.random.seed(42)
> alphabet = list(string.ascii_uppercase)
> _col = []
> for _n in range(1000):
>   k = ''.join(np.random.choice(alphabet, 3000))
>   _col += [k] * 1000
> table = pa.Table.from_pydict({'col': _col})
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to