Re: [PR] GH-49946: [Format] Better document equivalence between IPC file and streams [arrow]

via GitHub Thu, 07 May 2026 10:46:53 -0700


paleolimbot commented on code in PR #49947:
URL: https://github.com/apache/arrow/pull/49947#discussion_r3203473469



##########
docs/source/format/Columnar.rst:
##########
@@ -1333,22 +1334,21 @@ The flattened version of this is: ::
 For the buffers produced, we would have the following (refer to the
 table above): ::
 
-    buffer 0: field 0 validity
-    buffer 1: field 1 validity
-    buffer 2: field 1 values
-    buffer 3: field 2 validity
-    buffer 4: field 2 offsets
-    buffer 5: field 3 validity
-    buffer 6: field 3 values
-    buffer 7: field 4 validity
-    buffer 8: field 4 values
-    buffer 9: field 5 validity
-    buffer 10: field 5 offsets
-    buffer 11: field 5 data
-
-The ``Buffer`` Flatbuffers value describes the location and size of a
-piece of memory. Generally these are interpreted relative to the
-**encapsulated message format** defined below.
+    buffer 0: field 0 ('col1') validity
+    buffer 1: field 1 ('col1.a') validity
+    buffer 2: field 1 ('col1.a') values
+    buffer 3: field 2 ('col1.b') validity
+    buffer 4: field 2 ('col1.b') offsets
+    buffer 5: field 3 ('col1.b.item') validity
+    buffer 6: field 3 ('col1.b.item') values
+    buffer 7: field 4 ('col1.c') validity
+    buffer 8: field 4 ('col1.c') values
+    buffer 9: field 5 ('col2') validity
+    buffer 10: field 5 ('col2') offsets
+    buffer 11: field 5 ('col2') data
+
+The ``Buffer`` Flatbuffers value describes the location and size of a buffer's
+data, relatively to the start of the RecordBatch message's body.

Review Comment:
   I believe the offsets can be global in the dissociated IPC protocol 
(metadata and bodies sent on separate streams), although I forget if that was 
ever actually implemented anywhere.



##########
docs/source/format/Columnar.rst:
##########
@@ -1524,21 +1529,46 @@ Schematically we have: ::
     <empty padding bytes [to 8 byte boundary]>
     <STREAMING FORMAT with EOS>
     <FOOTER>
-    <FOOTER SIZE: int32>
+    <FOOTER SIZE: little-endian int32>
     <magic number "ARROW1">
 
-In the file format, there is no requirement that dictionary keys
-should be defined in a ``DictionaryBatch`` before they are used in a
-``RecordBatch``, as long as the keys are defined somewhere in the
-file. Further more, it is invalid to have more than one **non-delta**
-dictionary batch per dictionary ID (i.e. dictionary replacement is not
-supported). Delta dictionaries are applied in the order they appear in
-the file footer. We recommend the ".arrow" extension for files created with
-this format. Note that files created with this format are sometimes called
-"Feather V2" or with the ".feather" extension, the name and the extension
-derived from "Feather (V1)", which was a proof of concept early in
-the Arrow project for language-agnostic fast data frame storage for
-Python (pandas) and R.
+Equivalence with the IPC Streaming Format
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+While it is theoretically possible for the IPC File footer to list RecordBatch
+messages in a differing order from the embedded IPC Stream's sequential order
+(or even to repeat or omit some of the IPC Stream's RecordBatch messages),
+compliant writers SHOULD arrange the IPC File footer so that an IPC File can be
+read using an IPC Stream reader with equivalent results.

Review Comment:
   It may be nice at some point to indicate in the "features" section of a 
flatbuffers Schema that the stream can definitely be read as an IPC stream 
(i.e., doesn't differ between what one would get from reading using the blocks 
in the footer). The fact that nanoarrow does this blindly is not great and I'll 
fix it, but it is a cool feature that you can do full scans without random 
access in most cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-49946: [Format] Better document equivalence between IPC file and streams [arrow]

Reply via email to