Re: [I] [Python] Segfault in `to_pandas()` on batch from IPC stream in specific edge cases [arrow]

via GitHub Thu, 02 May 2024 04:18:01 -0700


jorisvandenbossche commented on issue #41469:
URL: https://github.com/apache/arrow/issues/41469#issuecomment-2090247321


   Thanks, I can now reproduce it as well!
   
   I think you are right in the observation that this seems a problem with the 
data generated on the Java/Spark side (although it is still strange it 
segfaults or not depending on numpy being imported first or not)
   
   When reading your IPC stream file without converting to pandas and 
inspecting the data, we can see that it is indeed invalid data:
   
   ```python
   import pyarrow as pa
   
   with pa.ipc.open_stream("../Downloads/arrow_stream.txt") as reader:
       batch = reader.read_next_batch()
   
   arr = batch["value"]
   
   >>> arr
   <pyarrow.lib.ListArray object at 0x7fe352a8d1e0>
   [
     null,
     null,
     null
   ]
   
   # Validating / inspecting the parent array
   >>> arr.validate(full=True)
   >>> arr.offsets
   <pyarrow.lib.Int32Array object at 0x7fe29daa0460>
   [
     0,
     0,
     0,
     0
   ]
   >>> arr.values
   <pyarrow.lib.ListArray object at 0x7fe29d84c0a0>
   []
   
   # Validating / inspecting the first child array
   >>> arr.values.validate(full=True)
   >>> arr.values.offsets
   <pyarrow.lib.Int32Array object at 0x7fe29f3ef880>
   <Invalid array: Buffer #1 too small in array of type int32 and length 1: 
expected at least 4 byte(s), got 0
   >>> arr.values.values
   <pyarrow.lib.ListArray object at 0x7fe29f238760>
   []
   ```
   
   So the offsets of the child array are missing. This child array has a length 
of 0, but following the format the offsets still need to have  length of 1. 
   I seem to remember this is a case that has come up before.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Segfault in `to_pandas()` on batch from IPC stream in specific edge cases [arrow]

Reply via email to