quentin lhoest created ARROW-15837:
--------------------------------------

             Summary: [Python] ListArray.offsets is wrong when it contains both 
lists and null values
                 Key: ARROW-15837
                 URL: https://issues.apache.org/jira/browse/ARROW-15837
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0
            Reporter: quentin lhoest


Hi ! I noticed this bug by running this code:
{code:java}
import pyarrow as pa

arr = pa.array([None, [0]])
reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values)
print(reconstructed_arr.to_pylist())
# [[], [0]] {code}
The resulting array, reconstructed from the offsets and values of the original 
array, {*}is not the same at the original array{*}.

This is the case because it seems that `arr.offsets` is wrong. Indeed it 
returns `[0, 0, 1]` instead of `[None, 0, 1]`:
{code:java}
print(arr.offsets.to_pylist())
# [0, 0, 1]

fixed_reconstructed_arr = pa.ListArray.from_arrays(pa.array([None, 0, 1]), 
arr.values)
print(fixed_reconstructed_arr.to_pylist())
# [None, [0]]{code}
If it can help, here is my investigation:

The offsets seem to be wrong because they don't include the validity bitmap 
from `{{{}arr.buffers()[0]`{}}}, which is used to say which values are null and 
which values are non-null. Therefore the `None` is replaced by `0`.

Though even if the validity bitmap is not taken into account at all, I checked 
its value and it  was not what I expected: the validity bitmap at 
`{{{}arr.buffers()[0]`{}}} is supposed to be `110` (in order to mask the None 
in `[None, 0, 1]`) but it is `10` for some reason:
{code:java}
bin(int(arr.buffers()[0].hex(), 16))
# '0b10'
# I think it should be 0b110 - 1 corresponds to non-null and 0 corresponds to 
null, if you take the bits in reverse order {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to