quentin lhoest created ARROW-15837:
--------------------------------------
Summary: [Python] ListArray.offsets is wrong when it contains both
lists and null values
Key: ARROW-15837
URL: https://issues.apache.org/jira/browse/ARROW-15837
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 7.0.0
Reporter: quentin lhoest
Hi ! I noticed this bug by running this code:
{code:java}
import pyarrow as pa
arr = pa.array([None, [0]])
reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values)
print(reconstructed_arr.to_pylist())
# [[], [0]] {code}
The resulting array, reconstructed from the offsets and values of the original
array, {*}is not the same at the original array{*}.
This is the case because it seems that `arr.offsets` is wrong. Indeed it
returns `[0, 0, 1]` instead of `[None, 0, 1]`:
{code:java}
print(arr.offsets.to_pylist())
# [0, 0, 1]
fixed_reconstructed_arr = pa.ListArray.from_arrays(pa.array([None, 0, 1]),
arr.values)
print(fixed_reconstructed_arr.to_pylist())
# [None, [0]]{code}
If it can help, here is my investigation:
The offsets seem to be wrong because they don't include the validity bitmap
from `{{{}arr.buffers()[0]`{}}}, which is used to say which values are null and
which values are non-null. Therefore the `None` is replaced by `0`.
Though even if the validity bitmap is not taken into account at all, I checked
its value and it was not what I expected: the validity bitmap at
`{{{}arr.buffers()[0]`{}}} is supposed to be `110` (in order to mask the None
in `[None, 0, 1]`) but it is `10` for some reason:
{code:java}
bin(int(arr.buffers()[0].hex(), 16))
# '0b10'
# I think it should be 0b110 - 1 corresponds to non-null and 0 corresponds to
null, if you take the bits in reverse order {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)