zeroshade commented on issue #13875: URL: https://github.com/apache/arrow/issues/13875#issuecomment-1215324205
Hey @liusitan, the first message in an Arrow IPC stream should always be a `Schema` message which describes the actual schema of the following record batch messages. So anything processing an IPC `RecordBatch` message should already have the expected schema for that record batch. Then you have the `nodes` field of the flatbuffer message which is a flattened version of the logical schema. So for example, if you had the following schema: ``` col1: Struct<a: Int32, b: List<item: Int64>, c: Float64> col2: Utf8 ``` That `Nodes` list would be: ``` FieldNode 0: Struct name='col1' FieldNode 1: Int32 name='a' FieldNode 2: List name='b' FieldNode 3: Int64 name='item' FieldNode 4: Float64 name='c' FieldNode 5: Utf8 name='col2' ``` Each of those `FieldNode` objects contains the appropriate metadata describing the length of the array and the number of nulls. The recordbatch message also has that `buffers` list which is a list of the offset and length of the buffers that are sent in the message body. Thus, for the above described example we'd expect the buffers described to correspond to: ``` buffer 0: field 0 validity bitmap buffer 1: field 1 validity bitmap buffer 2: field 1 values buffer (ie: a buffer of int32 values) buffer 3: field 2 validity bitmap buffer 4: field 2 offsets (ie: the offsets buffer for the List array, a buffer of int32 values) buffer 5: field 3 validity bitmap buffer 6: field 3 values (ie: a buffer of int64 values) buffer 7: field 4 validity bitmap buffer 8: field 4 values (ie: a buffer of float64 values) buffer 9: field 5 validity bitmap buffer 10: field 5 offsets (ie: the offsets for col2, the utf8 array, a buffer of int32 values) buffer 11: field 5 data (ie: the data buffer for the utf8 column, all of the string values) ``` For your provided example: ``` name | age | balance 'jack' | 12. | 100.23 'Jennie' | 24 | 2000.34 ``` The first message in the stream would be the schema: ``` name: utf8 age: Int32 balance: Float64 ``` the `nodes` in the `RecordBatch` message (the second message of the stream) would contain: ``` FieldNode 0: Utf8 name='name', length=2, nulls=0 FieldNode 1: Int32 name='age', length=2, nulls=0 FieldNode 2: Float64 name='balance', length=2, nulls=0 ``` So now you can parse the buffers because you know the types: ``` buffer 0: validity bitmap for name column (should be 1 byte) buffer 1: offsets for name column (should be the int32 values [0, 4, 10]) buffer 2: data buffer for name column (should be 'jackJennie') buffer 3: validity bitmap for age column (should be 1 byte) buffer 4: values for age column (should be int32 values [12, 24], ie: 8 bytes) buffer 5: validity bitmap for balance column buffer 6: values for balance column (should be the bytes for [100.23, 2000.34], ie: 16 bytes) ``` Hope that helps -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
