[
https://issues.apache.org/jira/browse/ARROW-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220623#comment-17220623
]
Artem KOZHEVNIKOV commented on ARROW-10172:
-------------------------------------------
btw, in pyarrow=2, the behaviour has changed
{code:python}
type(pa.array(['a' * 128] * 10**8))
pyarrow.lib.ChunkedArray # before it was pyarrow.lib.StringArray
type(pa.array(['a' * 128] * 10**8, pa.large_string()))
pyarrow.lib.LargeStringArray
{code}
pa.concat_arrays will overflow
{code:python}
str_array = pa.array(['a' * 128] * 10**8)
pa.concat_arrays(str_array.chunks)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-49-ec7499df74ad> in <module>
----> 1 pa.concat_arrays(str_array.chunks)
/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/array.pxi in
pyarrow.lib.concat_arrays()
/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in
pyarrow.lib.pyarrow_internal_check_status()
/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in
pyarrow.lib.check_status()
ArrowInvalid: offset overflow while concatenating arrays
{code}
and upcast to large string does not work neither, so it looks like we cannot
handle such large arrays correctly as now.
> [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's
> capacity overflows
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-10172
> URL: https://issues.apache.org/jira/browse/ARROW-10172
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1
> Reporter: Artem KOZHEVNIKOV
> Priority: Major
>
> I'm sorry if this was already reported, but there's an overflow issue in
> concatenation of large arrays
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: str_array = pa.array(['a' * 128] * 10**8)
> In [3]: large_array = pa.concat_arrays([str_array] * 50)
> Segmentation fault (core dumped)
> {code}
> I suppose that this should be handled by upcast to large_string.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)