jorisvandenbossche commented on issue #41833:
URL: https://github.com/apache/arrow/issues/41833#issuecomment-2141586788
@timsaucer I see what you mean, but as far as I know, nothing in the Arrow
columnar format specification requires that those values are null.
In the end, also for a primitive array with a null, we actually put some
"default" value in the null slot:
```python
>>> arr = pa.array([1, None, 3])
>>> arr
<pyarrow.lib.Int64Array object at 0x7f23782d5360>
[
1,
null,
3
]
# using nanoarrow to more easily view the actual buffers
>>> import nanoarrow as na
>>> na.array(arr).inspect()
<ArrowArray int64>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
- validity <bool[1 b] 10100000>
- data <int64[24 b] 1 0 3> # <-- looking at the actual data buffer, the
null slot is also filled with 0
- dictionary: NULL
- children[0]:
```
Similarly, in the nested struct case, those default values in the child
array are masked by the validity of the parent struct array.
I know it is not exactly the same given I am comparing a buffer with a child
array, but the principle is the same: the null is determined by the validity
bitmap, and that at point the underlying value (whether this is a buffer slot
or a child arrays slot) can be any value.
While you could argue that for specifically this kind of conversion of
python objects to Arrow data, we _could_ put a null in the child array as well
(although that would require to allocate an additional validity bitmap in this
small example case), other code should never assume this is the case, as you
can easily create a StructArray in a different way (eg directly from the child
arrays and a validity bitmap) that would also not give this guarantee.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]