Re: Cross-implementation metadata specification, IPC details

Wes McKinney Sun, 06 Mar 2016 14:02:37 -0800

Chao -- yes, I believe you're right.

In an IPC message, if the node is required or has null_count == 0,
then the null bitmap can be omitted from the payload, but otherwise
it's there (this is similar to the Hive HS2 V6 protocol).


One alternative to this is that rather than omitting the bitmap, its
data header metadata can be included but the buffer size is zero.

- Wes

On Sun, Mar 6, 2016 at 12:33 PM, Chao Sun <sunc...@apache.org> wrote:
> Hi,
>
> Sorry if this is a n00b question. For the example that Jacques used in the
> previous thread, how does it work if the struct is nullable - shouldn't
> there
> be a is_null array between 1 and 2?
>
> For instance, with the example:
>
> list elem #0: { <f0: aaa, f1: 42>, null, <f0: null, f1: 28> }
> list elem #1: { <f0: bbb, f1: null> }
> list elem #2: null
> list elem #3: { <f0: null, f1: null> }
>
> what does the encoding look like?
>
> I'm thinking: (0 = false, 1 = true in is_null arrays):
>
> 0 (list is_null): 0 0 1 0
> 1 (list offset): 0 3 4 4 5
> 2 (struct is_null): 0 1 0 0 0
> 3 (struct field f0 is_null): 0 1 0 1
> 4 (struct field f0 offset): 0 3 3 6 6
> 5 (struct field f0 value): a a a b b b
> 6 (struct field f1 is_null): 1 1 0 0
> 7 (struct field f1 value): 42 28
>
> Let me know if this is wrong. Thanks!
>
> Best,
> Chao
>
> On Tue, Mar 1, 2016 at 1:13 PM, Wes McKinney <w...@cloudera.com> wrote:
>
>> Inline responses
>>
>> On Tue, Mar 1, 2016 at 8:24 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>> > Wes, thanks for starting this conversation.
>> >
>> > Couple thoughts:
>> >
>> > For metadata, we have two models existing (one in the ValueVectors
>> approach
>> > and one in Parquet). It seems like we should start from one of those and
>> > then shape as appropriate. It seems like we have a richer physical
>> > capability that the core Dremel algorithm that Parquet implements so I
>> > think it would make sense to focus first on the logical model and then
>> > figure out the shared physical that exists below that.
>> >
>> > While the Data Headers item (2) in your description may come logically
>> > second, I think that it greatly informs 1.B as I believe 2 is something
>> > that should be an in-memory canonical representation (similar to the
>> > vectors themselves). I know Steven has been looking at moving the Java
>> > layer over to serialize the data headers using something similar to this:
>> >
>> > Data headers use a deterministic pre-order "tree" ordering of the memory
>> > buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data
>> structures
>> > are simply an array of data headers consisting of a list of buffer
>> offsets
>> > and sizes.
>> >
>>
>> This makes sense, and is consistent with Parquet's depth-first
>> (pre-order) flattening of nested schemas into a flat
>> list<SchemaElement>
>> (
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L556
>> ).
>>
>> > For example, consider this schema:
>> >
>> > List<Struct<String=List<UInt8>, Int32>>
>> >
>> > the pre-order buffer order is
>> >
>> > 0: nulls top level list
>> >
>> > 1: list offsets
>> >
>> > 2: struct field 0 nulls
>> >
>> > 3: struct field 0 list offsets
>> >
>> > 4: struct field 0 inner UInt8 values
>> >
>> > 5: struct field 1 nulls
>> >
>> > 6: struct field 1 Int32 values
>> >
>>
>> Regarding the buffers, we would want to embed the null count for each
>> logical array in some way, so that the null bitmap can be omitted.
>> This also spares having to compute the actual null counts when
>> receiving a row batch.
>>
>> For example, a List<T>, we could have a couple cases:
>>
>> List
>> - null_count > 0 -> 2 buffers, 1 for the null bitmap, 1 for the list
>> offsets
>> - null_count == 0 -> 1 buffer for the list offsets
>>
>> Recursively, the buffers for the type T child data would similarly
>> have its own null count, determining the actual number of memory
>> buffers. Let me know what you think.
>>
>> > The flatbuffer schema for the data header would then be:
>> >
>> > namespace DataHeaders;
>> >
>> > struct Buffer {
>> >
>> >  data: long;
>> >
>> >  length: int;
>> >
>> > }
>> >
>> > // Representing a single array (aka ValueVector), typically
>> >
>> > table BufferList {
>> >
>> >  // With FBS it is not possible to know the length of an array
>> >
>> >  n_buffers: int;
>> >
>> >  buffers: [Buffer];
>> >
>> > }
>> >
>> > // Multiple arrays -- could be used for long arrays or a
>> >
>> > // whole table row batch
>> >
>> > table ArrayBatch {
>> >
>> >  n_arrays: int;
>> >
>> >  arrays: [BufferList];
>> >
>> > }
>>
>> I've been tinkering with flatbuffers, and array lengths are indeed
>> available in the generated IDL (see flatbuffers::Vector::size in the
>> C++ API, for example), so the n_buffers / n_arrays fields aren't
>> needed.
>>
>> Per the discussion above re: null counts, something like this may work:
>>
>> struct Buffer {
>>  data: long;
>>  length: int;
>> }
>>
>> table ArrayNode {
>>   length: int;
>>   null_count: int;
>>
>>   /// The number of buffers would be inferred from the node type and actual
>>   /// null_count
>>   buffers: [Buffer];
>> }
>>
>> table RowBatch {
>>   /// Nodes correspond to the depth-first flattened logical schema
>>   nodes: [ArrayNode];
>> }
>>
>> So in an example schema:
>>
>> row batch schema {
>>   f0: List<List<UInt8>>,
>>   f1: Int32
>> }
>>
>> The flattened version of this schema contains 4 ArrayNodes, 3 for the
>> nested f0 column, and 1 for the flat f1 column.
>>
>> - Wes
>>
>> >
>> >
>> > On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <w...@cloudera.com> wrote:
>> >
>> >> hello all,
>> >>
>> >> I wanted to kick-start the process of coming up with a standardized /
>> >> canonical metadata specification that we can use for describing Arrow
>> >> data to be moved between systems. This breaks down into at least two
>> >> distinct kinds of metadata
>> >>
>> >> 1) "Schemas": physical types, logical types, child array types, struct
>> >> field names, and so forth. Does not contain information about the size
>> >> of the actual physical data (which depends on the length of arrays and
>> >> the sizes of list/variable-length type dimensions).
>> >>
>> >> 2) "Data headers": a description of the shape of a physical chunk of
>> >> data associated with a particular schema. Array length, null count,
>> >> memory buffer offsets and sizes, etc. This is the information you need
>> >> to compute the right pointers into a shared memory region or IPC/RPC
>> >> buffer and reconstruct Arrow container classes.
>> >>
>> >> Since #2 will depend on some of the details of #1, I suggest we start
>> >> figuring out #1 first. As far as the type metadata is concerned, to
>> >> avoid excess bike shedding we should break that problem into:
>> >>
>> >> A) The general layout of the type metadata / schemas
>> >> B) The technology we use for representing the schemas (and data
>> >> headers) in an implementation-independent way for use in an IPC/RPC
>> >> setting (and even to "store" ephemeral data on disk)
>> >>
>> >> On Item B, from what I've seen with Parquet and such file formats with
>> >> embedded metadata, and in the spirit of Arrow's "deserialize-nothing"
>> >> ethos, I suggest we explore no-deserialization technologies like
>> >> Google's Flatbuffers (https://github.com/google/flatbuffers) as a more
>> >> CPU-efficient alternative to Thrift, Protobuf, or Avro. In large
>> >> schemas, technologies like Thrift can result in significant overhead
>> >> in "needle-in-haystack" problems where you are picking only a few
>> >> columns out of very wide tables (> 1000s of columns), and it may be
>> >> best to try to avoid this if at all possible.
>> >>
>> >> I would like some help stewarding the design process on this from the
>> >> Arrow PMC and in particular those who have worked on the design and
>> >> implementation of Parquet and other file formats and systems for which
>> >> Arrow is an immediate intended companion. Lot of things we can learn
>> >> from those past experiences.
>> >>
>> >> Thank you,
>> >> Wes
>> >>
>>

Re: Cross-implementation metadata specification, IPC details

Reply via email to