[GitHub] [arrow] zeroshade commented on issue #13875: How does arrow parse its ipc format?

GitBox Mon, 15 Aug 2022 09:36:55 -0700


zeroshade commented on issue #13875:
URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215324205


   Hey @liusitan, the first message in an Arrow IPC stream should always be a 
`Schema` message which describes the actual schema of the following record 
batch messages. So anything processing an IPC `RecordBatch` message should 
already have the expected schema for that record batch. 
   
   Then you have the `nodes` field of the flatbuffer message which is a 
flattened version of the logical schema. So for example, if you had the 
following schema:
   
   ```
   col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
   col2: Utf8
   ```
   
   That `Nodes` list would be:
   
   ```
   FieldNode 0: Struct name='col1'
   FieldNode 1: Int32 name='a'
   FieldNode 2: List name='b'
   FieldNode 3: Int64 name='item'
   FieldNode 4: Float64 name='c'
   FieldNode 5: Utf8 name='col2'
   ```
   
   Each of those `FieldNode` objects contains the appropriate metadata 
describing the length of the array and the number of nulls. The recordbatch 
message also has that `buffers` list which is a list of the offset and length 
of the buffers that are sent in the message body. Thus, for the above described 
example we'd expect the buffers described to correspond to:
   
   ```
   buffer 0: field 0 validity bitmap
   buffer 1: field 1 validity bitmap
   buffer 2: field 1 values buffer (ie: a buffer of int32 values)
   buffer 3: field 2 validity bitmap
   buffer 4: field 2 offsets (ie: the offsets buffer for the List array, a 
buffer of int32 values)
   buffer 5: field 3 validity bitmap
   buffer 6: field 3 values (ie: a buffer of int64 values)
   buffer 7: field 4 validity bitmap
   buffer 8: field 4 values (ie: a buffer of float64 values)
   buffer 9: field 5 validity bitmap
   buffer 10: field 5 offsets (ie: the offsets for col2, the utf8 array, a 
buffer of int32 values)
   buffer 11: field 5 data (ie: the data buffer for the utf8 column, all of the 
string values)
   ```
   
   For your provided example:
   
   ```
   name | age | balance
   'jack'    | 12.  |  100.23
   'Jennie' | 24   | 2000.34
   ```
   
   The first message in the stream would be the schema:
   
   ```
   name: utf8
   age: Int32
   balance: Float64
   ```
   
   the `nodes` in the `RecordBatch` message (the second message of the stream) 
would contain:
   
   ```
   FieldNode 0: Utf8 name='name', length=2, nulls=0
   FieldNode 1: Int32 name='age', length=2, nulls=0
   FieldNode 2: Float64 name='balance', length=2, nulls=0
   ```
   
   So now you can parse the buffers because you know the types:
   
   ```
   buffer 0: validity bitmap for name column (should be 1 byte)
   buffer 1: offsets for name column (should be the int32 values [0, 4, 10])
   buffer 2: data buffer for name column (should be 'jackJennie')
   buffer 3: validity bitmap for age column (should be 1 byte)
   buffer 4: values for age column (should be int32 values [12, 24], ie: 8 
bytes)
   buffer 5: validity bitmap for balance column
   buffer 6: values for balance column (should be the bytes for [100.23, 
2000.34], ie: 16 bytes)
   ```
   
   Hope that helps


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zeroshade commented on issue #13875: How does arrow parse its ipc format?

Reply via email to