[jira] [Commented] (ARROW-2367) [Python] ListArray has trouble with sizes greater than kMaximumCapacity

Wes McKinney (JIRA) Mon, 11 Feb 2019 19:29:27 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765637#comment-16765637
 ]


Wes McKinney commented on ARROW-2367:
-------------------------------------

Oof, ok, this is complicated. 

We need to return a chunked array in the example above, where each chunk is a 
{{ListArray}}. The serious problem here is that a binary overflow can occur _in 
the middle of a list element_. 

So we need to determine if appending the list values (e.g. an array slot's list 
of strings) is going to overflow the builder, and in that case start the next 
array chunk. 

This is made more complicated by the SeqConverter initialization pattern in 
python_to_arrow.cc. 

I'll leave this issue in 0.13 but it feels like a solid 2-3 hours of 
refactoring to get things sorted out

> [Python] ListArray has trouble with sizes greater than kMaximumCapacity
> -----------------------------------------------------------------------
>
>                 Key: ARROW-2367
>                 URL: https://issues.apache.org/jira/browse/ARROW-2367
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Bryant Menn
>            Assignee: Wes McKinney
>            Priority: Major
>             Fix For: 0.13.0
>
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in <module>
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 100000000
> v2 = b'x' * 147483646
> # v1 = 'x' * 100000000
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>      'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>      # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2367) [Python] ListArray has trouble with sizes greater than kMaximumCapacity

Reply via email to