bkietz opened a new issue, #6311:
URL: https://github.com/apache/arrow-rs/issues/6311

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   
   The format guarantees that each IPC file [embeds a valid IPC 
stream](https://github.com/apache/arrow/blob/bde6ac57ad943f5938506359de1b13fb85b4f8ea/docs/source/format/Columnar.rst)
 in order to allow readers to ignore the Footer, skip the file's leading magic, 
and reuse a stream reader.
   
   However, when writing IPC files arrow-rs aligns the encapsulated flatbuffers 
Messages to 64 byte boundaries instead of 8 bytes. This can leave gaps of 
padding bytes between the Messages which a stream reader would not know to skip.
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   https://github.com/apache/arrow/pull/43834 adds validation of the embedded 
stream to arrow-c++'s `arrow-json-integration-test`. Running this against IPC 
files written by arrow-rs [raises an 
error](https://github.com/apache/arrow/actions/runs/10565413633/job/29269867382?pr=43834#step:9:27320)
   
   ```shell-session
   $ arrow-json-integration-test -arrow datetime.arrow -json datetime.json 
-integration -mode VALIDATE
   Error message: Invalid: Tried reading schema message, was null or length 0
   ```
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   A stream reader should be able to read an IPC file by skipping the first 8 
bytes.
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   
   This was originally introduced in 
https://github.com/apache/arrow-rs/commit/eddef43d1cb46c1287da187ea1d86b0e1dc35a13
 which added alignment to address new requirements around `i128`. However the 
alignment should not be applied to flatbuffers Messages; apart from the above 
issue I think there's no SIMD or other advantage to aligning those to more than 
8 bytes. Body buffers can of course still be padded and aligned freely.
   
   This was discovered while adding IPC file reading to nanoarrow; we were 
trying to defer reading Footers for a follow up and discovered that the go, 
rust, and javascript implementations don't embed a valid stream. Most readers 
have not noticed because offsets and schemas are more efficiently read from a 
Footer, and once acquired obviate sequential stream-style reading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to