nbauernfeind commented on a change in pull request #11646: URL: https://github.com/apache/arrow/pull/11646#discussion_r745737341
########## File path: format/Message.fbs ########## @@ -117,6 +117,40 @@ table DictionaryBatch { isDelta: bool = false; } +/// A range of field nodes, identified by their offset in the schema. +/// The offsets are zero-indexed. +struct FieldNodeRange { + /// The starting offset (inclusive) + start: long; + + /// The ending offset (exclusive) + end: long; +} + +/// A data header describing the shared memory layout of a "bag" of "columns". +/// It is similar to a RecordBatch but not every top level FieldNode is required +/// to be included in the wire payload. +table ColumnBag { + /// If not provided, all field nodes are included and this payload is + /// identical to a RecordBatch. Otherwise the reader needs to skip + /// top level FieldNodes that were not included. + includedNodes: [FieldNodeRange]; Review comment: In RecordBatch, the FieldNodes are listed in-order from first field node, to its children, and grandchildren, followed by the second field node. Note that if we include a top level field node we must include its children. This requirement certainly applies to array-types, and I assume it applies to nested structures -- but I have not used them enough to play with the idea. I think there are a few options to represent which top level nodes to include. 1) encoded BitSet, but it is too easy to create degenerate cases 2) each FieldNode could include a third parameter -- but in flatbuffers this means that the struct is written down differently (I think if the struct is greater than 16B then it must be pre-written before constructing the flatbuffer table that uses it) 3) include a parallel array with field node indicating which field offset, but this would be empty for child nodes 4) what remains is a compromise listing ranges of columns that were included -- the use case I have in mind is single-digit number of ranges almost always - but columns can be easily into the tens of dozens. > So to be clear, we can't do something like provide only a nested array - and implementations will need to validate that this only skips entire top level fields? I can't quite tell what solution you are proposing here. I think client implementations do end up working exactly like you are saying, though. Could you elaborate on your idea or defend an alternative approach? -- > Can we mark this as experimental like how Tensor does? Absolutely, patch update incoming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org