[ 
https://issues.apache.org/jira/browse/ARROW-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16443891#comment-16443891
 ] 

Krisztian Szucs commented on ARROW-2476:
----------------------------------------

Quoting from the spec:

"Any array has a known and fixed length, stored as a 32-bit signed integer, so 
a maximum of 2^31 - 1 elements." 
 Conclusion:
{code}
max_len(array) == 2**31 - 1
{code}
"An offsets buffer containing 32-bit signed integers with length equal to the 
length of the top-level array plus one. Note that this limits the size of the 
values array to 2^31-1" 
 Conclusion:
{code}
len(offsets) == len(array) + 1
max_len(array) == 2**31 - 1 

# => thus
max_len(offsets) == 2**31  
{code}
In 
[builder.h|https://github.com/apache/arrow/blob/master/cpp/src/arrow/builder.h#L44]
{code}
constexpr int64_t kBinaryMemoryLimit = std::numeric_limits<int32_t>::max() - 1;
constexpr int64_t kListMaximumElements = std::numeric_limits<int32_t>::max() - 
1;
// where both are actually (2**31 - 1) - 1
{code}
And according to Antoine's first comment cpp and python impl uses (2^63 - 1) 
instead of (2^31 - 1) in case of primitive types but not for list, string, 
binary.
 Why is that limitation for list-like arrays but not for primitive ones?

As a sidenote the length of the offsets buffer can't be represented in a 32-bit 
signed integer (for an array with maximum (2^31 - 1) number of element), it 
requires a 64 bit signed integer like in the 
[implementation|https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h#L136].
 This also means that theoretically a list array with (2^63 - 1) element 
requires a buffer with maximum 2^63 number of elements, representable as int128 
(of course we won't hit that limit).

I mean the values are correct (plus/minus one), but the limits (or the 
limitlessness) should be documented somewhere - clearly and consistently.

> [Python/Question] Maximum length of an Array created from ndarray
> -----------------------------------------------------------------
>
>                 Key: ARROW-2476
>                 URL: https://issues.apache.org/jira/browse/ARROW-2476
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Krisztian Szucs
>            Priority: Minor
>
> So the format 
> [describes|https://github.com/apache/arrow/blob/master/format/Layout.md#array-lengths]
>  that an array max length is 2^31 - 1, however the following python snippet 
> creates a 2**32 length arrow array:
> {code:python}
> a = np.ones((2**32,), dtype='int8')
> A = pa.Array.from_pandas(a)
> type(A)
> {code}
> {code}pyarrow.lib.Int8Array{code}
> Based the layout specification I'd expect a ChunkedArray of three Int8Array's 
> with lengths:
> [2^31 - 1, 2^31 - 1, 2] or should raise an exception?
> If it's the expectation is there any documentation for it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to