Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Wes McKinney Thu, 09 Nov 2017 09:24:54 -0800

Yep, see 
https://github.com/apache/arrow/blob/master/format/Layout.md#null-bitmaps


"Arrays having a 0 null count may choose to not allocate the null bitmap."

I do not know what the Java library will do in the event of 0 null
count and 0-length validity bitmap -- in theory this should be
accounted for already in the integration tests, but we might want to
double check for our own sanity. In C++ we do not even bother with the
validity bitmap when the null count is 0

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L142

- Wes

On Thu, Nov 9, 2017 at 12:09 PM, Brian Hulette <[email protected]> wrote:
> Ah! It didn't occur to me that a producer could just send a length-0 buffer
> since the reader implementations should ignore it anyway. I don't mind the
> 16 byte cost of the metadata - I was referring to the bloat of a 100% valid
> vector, which could be substantial.
>
> Part of me wants to argue that the the dictionary indices are already
> "special", since all other fields, including children, have their own
> nullable attribute in the schema, while an index is specified by a lone
> indexType. Treating the index more like a field makes it less special, in my
> opinion. But that's just semantics, the ability to send length-0 buffers for
> a 100% valid index accomplishes what I'm after.
>
> Brian
>
>
>
> On 11/09/2017 10:00 AM, Wes McKinney wrote:
>>>
>>> So I'll go after the other validity vector - maybe producers should be
>>> allowed to omit the validity vector in the index? I just think if the goal
>>> is to reduce bloat then redundant validity vectors seems like a logical
>>> place to trim.
>>
>> Well, the cost of the additional buffer metadata is only 16 bytes --
>> on the wire I believe you are free to send a length-0 buffer if there
>> are no nulls. I am not sure this is worth making the dictionary
>> indices "special" during IPC reconstruction versus any other integer
>> vector.
>>
>> The metadata bloat that we're trimming by removing the buffer layouts
>> is more significant because the VectorLayout is a table, which has a
>> larger footprint in Flatbuffers
>>
>> On Thu, Nov 9, 2017 at 9:20 AM, Brian Hulette <[email protected]>
>> wrote:
>>>
>>> Good point. Its a nice feature of the format that a dictionary batch and
>>> a
>>> record batch with a single column look exactly the same when they
>>> represent
>>> the same logical type.
>>>
>>>
>>> So I'll go after the other validity vector - maybe producers should be
>>> allowed to omit the validity vector in the index? I just think if the
>>> goal
>>> is to reduce bloat then redundant validity vectors seems like a logical
>>> place to trim.
>>>
>>> Producers would need some way to communicate whether or not the index is
>>> nullable. Right now there's only a single nullable flag in the Field
>>> metadata
>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280),
>>> which
>>> determines whether or not both the index and the value vectors have a
>>> validity vector. What if there were a second nullable flag in the
>>> DictionaryEncoding table
>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252) that
>>> applies to the indexType?
>>>
>>> This idea does lead to one confusing edge case: a non-nullable dictionary
>>> vector with a nullable index. Maybe that should be allowed though, that
>>> would effectively represent the scheme that I was originally advocating
>>> for
>>> (validity vector in the index and not in the values).
>>>
>>> Brian
>>>
>>>
>>>
>>> On 11/08/2017 06:27 PM, Wes McKinney wrote:
>>>>
>>>> The dictionary batches simply wrap a record batch with one “column”.
>>>> There
>>>> should be no code difference (e.g. buffer layouts are the same) between
>>>> the
>>>> code handling the data in a dictionary and a normal record batches. In
>>>> general, a dictionary may contain a null.
>>>>
>>>> On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> Agreed, that sounds like a great solution to this problem - the layout
>>>>> information is redundant and it doesn't make sense to include it in
>>>>> every schema.
>>>>>
>>>>> Although I would argue we should write down exactly what buffers are
>>>>> supposed to go on the wire in the dictionary batches (i.e. value
>>>>> vectors) as well. This should be largely the same as what goes on the
>>>>> wire in a record batch for a non dictionary-encoded vector of the same
>>>>> type, but there could be a difference. For example, if a dictionary
>>>>> vector is nullable, do we really need a validity buffer in both the
>>>>> index and in the values? I think that's the current behavior, but maybe
>>>>> it would make sense to assert that a dictionary's value vector should
>>>>> be
>>>>> non-nullable, and nulls should be handled in the index vector?
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>> On 11/08/2017 05:24 PM, Wes McKinney wrote:
>>>>>>
>>>>>> Per Jacques' comment in ARROW-1693
>>>>>>
>>>>>
>>>>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812
>>>>> ,
>>>>>>
>>>>>> I think we should remove the buffer layout from the metadata. It would
>>>>>> be a good idea to do this for 0.8.0 since we're breaking the metadata
>>>>>> anyway.
>>>>>>
>>>>>> In addition to bloating the size of the schemas on the wire, the
>>>>>> buffer layout metadata provides redundant information which should be
>>>>>> a strict part of the Arrow specification. I agree with Jacques that it
>>>>>> would be better to write down exactly what buffers are supposed to go
>>>>>> on the wire for each logical type. In the case of the dictionary
>>>>>> vectors, it is the buffers for the indices, so the issue under
>>>>>> discussion resolves itself if we nix the metadata.
>>>>>>
>>>>>> If writers are emitting possibly different buffer layouts (like
>>>>>> omitting a null or zero-length buffer), it will introduce brittleness
>>>>>> and cause much special casing to trickle down into the reader
>>>>>> implementations. This seems like undue complexity.
>>>>>>
>>>>>> - Wes
>>>>>>
>>>>>> On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <[email protected]>
>>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> We've been having some integration issues with reading Dictionary
>>>>>
>>>>> Vectors in
>>>>>>>
>>>>>>> the JS implementation - our current implementation can read arrow
>>>>>>> files
>>>>>
>>>>> and
>>>>>>>
>>>>>>> streams generated by Java, but not by C++. Most of this discussion is
>>>>>>> captured in ARROW-1693 [1].
>>>>>>>
>>>>>>> It looks like ultimately the issue is that there are inconsistencies
>>>>>>> in
>>>>>
>>>>> the
>>>>>>>
>>>>>>> way the various implementations handle buffer layouts for
>>>>>
>>>>> dictionary-encoded
>>>>>>>
>>>>>>> vectors in the Schema message. Some places write/read the buffer
>>>>>>> layout
>>>>>
>>>>> for
>>>>>>>
>>>>>>> the value vector (the vector found in the dictionary batch), and
>>>>>>> others
>>>>>>> expect the layout for the index vector (the int vector found in the
>>>>>
>>>>> record
>>>>>>>
>>>>>>> batch). Both the Java and C++ IPC readers don't seem to care about
>>>>>>> this
>>>>>>> portion of the Schema, which explains why the integration tests are
>>>>>
>>>>> passing.
>>>>>>>
>>>>>>> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers
>>>>>>> and
>>>>>>> writers handle those buffers layouts right now:
>>>>>>>
>>>>>>>         | Writer       | Reader
>>>>>>> -----+--------------+-------------
>>>>>>> Java | value vector | doesn't care
>>>>>>> C++  | index vector | doesn't care
>>>>>>> JS   | N/A          | value vector
>>>>>>>
>>>>>>> Note that I can only really speak with authority about the JS
>>>>>>> implementation. I'd appreciate it if people more familiar with the
>>>>>
>>>>> other two
>>>>>>>
>>>>>>> could validate my claims.
>>>>>>>
>>>>>>> As far as I can tell the expected behavior isn't stated anywhere in
>>>>>>> the
>>>>>>> documentation, which I suppose explains the inconsistency. Paul
>>>>>>> Taylor
>>>>>
>>>>> is
>>>>>>>
>>>>>>> currently working on resolving ARROW-1693 by making the JS reader
>>>>>
>>>>> ambivalent
>>>>>>>
>>>>>>> to buffer layout, but I think ultimately the correct solution is to
>>>>>
>>>>> agree on
>>>>>>>
>>>>>>> a consistent standard, and make the reader implementations
>>>>>>> opinionated
>>>>>
>>>>> about
>>>>>>>
>>>>>>> the Schema buffer layouts (i.e. ARROW-1362 [2]).
>>>>>>>
>>>>>>> Personally, I don't really have an opinion either way about which
>>>>>
>>>>> vector's
>>>>>>>
>>>>>>> layout should be in the Schema. Either way we'll be missing some
>>>>>>> layout
>>>>>>> information though, so we should also consider where the information
>>>>>
>>>>> for the
>>>>>>>
>>>>>>> "other" vector might go.
>>>>>>>
>>>>>>> I know there's a release coming up, and now is probably not the time
>>>>>>> to
>>>>>>> tackle this problem, but I wanted to write it up while its fresh in
>>>>>>> my
>>>>>
>>>>> mind.
>>>>>>>
>>>>>>> I'm fine shelving it until after 0.8.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-1693
>>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-1362
>>>>>
>>>>>
>

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Reply via email to