[
https://issues.apache.org/jira/browse/ARROW-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358632#comment-17358632
]
David Li commented on ARROW-12983:
----------------------------------
The Reserve method is another possible bug that needs to be worked around but
in this case it's not relevant: the binary builder's reserve is only a guess
(because it doesn't know the string lengths) so the allocation error only
happens during the actual append. (I confirmed this by adding some logging to
the allocator - you can watch it reallocate the data array until it hits the
max length, then it fails.) So items are actually appended here.
The case you point out should be fixed too, though.
> [C++] Converter::Extend gets stuck in infinite loop causing OOM if values
> don't fit in single chunk
> ---------------------------------------------------------------------------------------------------
>
> Key: ARROW-12983
> URL: https://issues.apache.org/jira/browse/ARROW-12983
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 4.0.0, 4.0.1
> Reporter: Laurent Mazare
> Assignee: David Li
> Priority: Major
>
> _Apologies if this is a duplicate, I haven't found anything related_
> When creating an arrow table via the python api, the following code runs out
> of memory after using all the available resources on a box with 512GB of ram.
> This happens with pyarrow 4.0.0 and 4.0.1. However when running the same code
> with pyarrow 3.0.0, the memory usage only reaches 5GB (which seems like the
> appropriate ballpark for the table size).
> The code generates a table with a single string column with 1m rows, each
> string being 3000 characters long.
> Not sure whether the issue is python related or not, I haven't tried
> replicating it from the C++ api.
>
> {code:python}
> import os, string
> import numpy as np
> import pyarrow as pa
> print(pa.__version__)
> np.random.seed(42)
> alphabet = list(string.ascii_uppercase)
> _col = []
> for _n in range(1000):
> k = ''.join(np.random.choice(alphabet, 3000))
> _col += [k] * 1000
> table = pa.Table.from_pydict({'col': _col})
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)