jorisvandenbossche commented on issue #41469:
URL: https://github.com/apache/arrow/issues/41469#issuecomment-2090247321
Thanks, I can now reproduce it as well!
I think you are right in the observation that this seems a problem with the
data generated on the Java/Spark side (although it is still strange it
segfaults or not depending on numpy being imported first or not)
When reading your IPC stream file without converting to pandas and
inspecting the data, we can see that it is indeed invalid data:
```python
import pyarrow as pa
with pa.ipc.open_stream("../Downloads/arrow_stream.txt") as reader:
batch = reader.read_next_batch()
arr = batch["value"]
>>> arr
<pyarrow.lib.ListArray object at 0x7fe352a8d1e0>
[
null,
null,
null
]
# Validating / inspecting the parent array
>>> arr.validate(full=True)
>>> arr.offsets
<pyarrow.lib.Int32Array object at 0x7fe29daa0460>
[
0,
0,
0,
0
]
>>> arr.values
<pyarrow.lib.ListArray object at 0x7fe29d84c0a0>
[]
# Validating / inspecting the first child array
>>> arr.values.validate(full=True)
>>> arr.values.offsets
<pyarrow.lib.Int32Array object at 0x7fe29f3ef880>
<Invalid array: Buffer #1 too small in array of type int32 and length 1:
expected at least 4 byte(s), got 0
>>> arr.values.values
<pyarrow.lib.ListArray object at 0x7fe29f238760>
[]
```
So the offsets of the child array are missing. This child array has a length
of 0, but following the format the offsets still need to have length of 1.
I seem to remember this is a case that has come up before.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]