[ 
https://issues.apache.org/jira/browse/ARROW-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220623#comment-17220623
 ] 

Artem KOZHEVNIKOV commented on ARROW-10172:
-------------------------------------------

btw, in pyarrow=2,  the behaviour has changed 
{code:python}
type(pa.array(['a' * 128] * 10**8))
pyarrow.lib.ChunkedArray  # before it was pyarrow.lib.StringArray
type(pa.array(['a' * 128] * 10**8, pa.large_string()))
pyarrow.lib.LargeStringArray 
{code}

pa.concat_arrays will overflow
{code:python}
str_array = pa.array(['a' * 128] * 10**8)
pa.concat_arrays(str_array.chunks)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-49-ec7499df74ad> in <module>
----> 1 pa.concat_arrays(str_array.chunks)

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/array.pxi in 
pyarrow.lib.concat_arrays()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/envs/model/lib/python3.7/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

ArrowInvalid: offset overflow while concatenating arrays
{code}
and upcast to large string does not work neither, so it looks like we cannot 
handle such large arrays correctly as now.


> [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's 
> capacity overflows
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-10172
>                 URL: https://issues.apache.org/jira/browse/ARROW-10172
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Artem KOZHEVNIKOV
>            Priority: Major
>
> I'm sorry if this was already reported, but there's an overflow issue in 
> concatenation of large arrays
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: str_array = pa.array(['a' * 128] * 10**8)
> In [3]: large_array = pa.concat_arrays([str_array] * 50)
> Segmentation fault (core dumped)
> {code}
> I suppose that  this should be handled by upcast to large_string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to