[GitHub] [arrow] nbauernfeind commented on a change in pull request #11646: ARROW-14634: [Flatbuffers] introduction of ColumnBag

GitBox Tue, 09 Nov 2021 07:36:20 -0800


nbauernfeind commented on a change in pull request #11646:
URL: https://github.com/apache/arrow/pull/11646#discussion_r745737341




##########
File path: format/Message.fbs
##########
@@ -117,6 +117,40 @@ table DictionaryBatch {
   isDelta: bool = false;
 }
 
+/// A range of field nodes, identified by their offset in the schema.
+/// The offsets are zero-indexed.
+struct FieldNodeRange {
+  /// The starting offset (inclusive)
+  start: long;
+
+  /// The ending offset (exclusive)
+  end: long;
+}
+
+/// A data header describing the shared memory layout of a "bag" of "columns".
+/// It is similar to a RecordBatch but not every top level FieldNode is 
required
+/// to be included in the wire payload.
+table ColumnBag {
+  /// If not provided, all field nodes are included and this payload is
+  /// identical to a RecordBatch. Otherwise the reader needs to skip
+  /// top level FieldNodes that were not included.
+  includedNodes: [FieldNodeRange];

Review comment:
       In RecordBatch, the FieldNodes are listed in-order from first field 
node, to its children, and grandchildren, followed by the second field node. 
Note that if we include a top level field node we must include its children. 
This requirement certainly applies to array-types, and I assume it applies to 
nested structures -- but I have not used them enough to play with the idea.
   
   I think there are a few options to represent which top level nodes to 
include.
   1) encoded BitSet, but it is too easy to create degenerate cases
   2) each FieldNode could include a third parameter -- but in flatbuffers this 
means that the struct is written down differently (I think if the struct is 
greater than 16B then it must be pre-written before constructing the flatbuffer 
table that uses it)
   3) include a parallel array with field node indicating which field offset, 
but this would be empty for child nodes
   4) what remains is  a compromise listing ranges of columns that were 
included -- the use case I have in mind is single-digit number of ranges almost 
always - but columns can be easily into the tens of dozens.
   
   > So to be clear, we can't do something like provide only a nested array - 
and implementations will need to validate that this only skips entire top level 
fields?
   
   I can't quite tell what solution you are proposing here. I think client 
implementations do end up working exactly like you are saying, though. Could 
you elaborate on your idea or defend an alternative approach? 
   
   -- 
   
   > Can we mark this as experimental like how Tensor does?
   
   Absolutely, patch update incoming.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nbauernfeind commented on a change in pull request #11646: ARROW-14634: [Flatbuffers] introduction of ColumnBag

Reply via email to