[
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900029#comment-16900029
]
Antoine Pitrou commented on ARROW-6132:
---------------------------------------
[~wesmckinn] [~xhochy] What do you think? I think there is value in spending a
few CPU cycles validating inputs in the Python APIs.
> [Python] ListArray.from_arrays does not check validity of input arrays
> ----------------------------------------------------------------------
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Priority: Minor
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no
> validation of the offsets that it starts with 0 and ends with the length of
> the array (but is that required? the docs seem to indicate that:
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
> ("The first value in the offsets array is 0, and the last element is the
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5))
> In [62]: a
> Out[62]:
> <pyarrow.lib.ListArray object at 0x7fdd9c468678>
> [
> [
> 1,
> 2
> ],
> [
> 3,
> 4
> ]
> ]
> In [63]: a.flatten()
> Out[63]:
> <pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
> [
> 0, # <--- includes the 0
> 1,
> 2,
> 3,
> 4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to
> ensure that the data is correct or call a safe (slower) constructor. But do
> we want to use the unsafe / fast constructors without validation in Python as
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does
> validation, but other `from_arrays` method don't seem to explicitly do this.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)