Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Li Jin Thu, 09 Nov 2017 09:33:28 -0800

If null count is 0, the java library sets the validity vectors to all 1s.

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java#L61


On Thu, Nov 9, 2017 at 12:23 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> Yep, see https://github.com/apache/arrow/blob/master/format/
> Layout.md#null-bitmaps
>
> "Arrays having a 0 null count may choose to not allocate the null bitmap."
>
> I do not know what the Java library will do in the event of 0 null
> count and 0-length validity bitmap -- in theory this should be
> accounted for already in the integration tests, but we might want to
> double check for our own sanity. In C++ we do not even bother with the
> validity bitmap when the null count is 0
>
> https://github.com/apache/arrow/blob/master/cpp/src/
> arrow/ipc/reader.cc#L142
>
> - Wes
>
> On Thu, Nov 9, 2017 at 12:09 PM, Brian Hulette <brian.hule...@ccri.com>
> wrote:
> > Ah! It didn't occur to me that a producer could just send a length-0
> buffer
> > since the reader implementations should ignore it anyway. I don't mind
> the
> > 16 byte cost of the metadata - I was referring to the bloat of a 100%
> valid
> > vector, which could be substantial.
> >
> > Part of me wants to argue that the the dictionary indices are already
> > "special", since all other fields, including children, have their own
> > nullable attribute in the schema, while an index is specified by a lone
> > indexType. Treating the index more like a field makes it less special,
> in my
> > opinion. But that's just semantics, the ability to send length-0 buffers
> for
> > a 100% valid index accomplishes what I'm after.
> >
> > Brian
> >
> >
> >
> > On 11/09/2017 10:00 AM, Wes McKinney wrote:
> >>>
> >>> So I'll go after the other validity vector - maybe producers should be
> >>> allowed to omit the validity vector in the index? I just think if the
> goal
> >>> is to reduce bloat then redundant validity vectors seems like a logical
> >>> place to trim.
> >>
> >> Well, the cost of the additional buffer metadata is only 16 bytes --
> >> on the wire I believe you are free to send a length-0 buffer if there
> >> are no nulls. I am not sure this is worth making the dictionary
> >> indices "special" during IPC reconstruction versus any other integer
> >> vector.
> >>
> >> The metadata bloat that we're trimming by removing the buffer layouts
> >> is more significant because the VectorLayout is a table, which has a
> >> larger footprint in Flatbuffers
> >>
> >> On Thu, Nov 9, 2017 at 9:20 AM, Brian Hulette <brian.hule...@ccri.com>
> >> wrote:
> >>>
> >>> Good point. Its a nice feature of the format that a dictionary batch
> and
> >>> a
> >>> record batch with a single column look exactly the same when they
> >>> represent
> >>> the same logical type.
> >>>
> >>>
> >>> So I'll go after the other validity vector - maybe producers should be
> >>> allowed to omit the validity vector in the index? I just think if the
> >>> goal
> >>> is to reduce bloat then redundant validity vectors seems like a logical
> >>> place to trim.
> >>>
> >>> Producers would need some way to communicate whether or not the index
> is
> >>> nullable. Right now there's only a single nullable flag in the Field
> >>> metadata
> >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280),
> >>> which
> >>> determines whether or not both the index and the value vectors have a
> >>> validity vector. What if there were a second nullable flag in the
> >>> DictionaryEncoding table
> >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252)
> that
> >>> applies to the indexType?
> >>>
> >>> This idea does lead to one confusing edge case: a non-nullable
> dictionary
> >>> vector with a nullable index. Maybe that should be allowed though, that
> >>> would effectively represent the scheme that I was originally advocating
> >>> for
> >>> (validity vector in the index and not in the values).
> >>>
> >>> Brian
> >>>
> >>>
> >>>
> >>> On 11/08/2017 06:27 PM, Wes McKinney wrote:
> >>>>
> >>>> The dictionary batches simply wrap a record batch with one “column”.
> >>>> There
> >>>> should be no code difference (e.g. buffer layouts are the same)
> between
> >>>> the
> >>>> code handling the data in a dictionary and a normal record batches. In
> >>>> general, a dictionary may contain a null.
> >>>>
> >>>> On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com>
> >>>> wrote:
> >>>>
> >>>>> Agreed, that sounds like a great solution to this problem - the
> layout
> >>>>> information is redundant and it doesn't make sense to include it in
> >>>>> every schema.
> >>>>>
> >>>>> Although I would argue we should write down exactly what buffers are
> >>>>> supposed to go on the wire in the dictionary batches (i.e. value
> >>>>> vectors) as well. This should be largely the same as what goes on the
> >>>>> wire in a record batch for a non dictionary-encoded vector of the
> same
> >>>>> type, but there could be a difference. For example, if a dictionary
> >>>>> vector is nullable, do we really need a validity buffer in both the
> >>>>> index and in the values? I think that's the current behavior, but
> maybe
> >>>>> it would make sense to assert that a dictionary's value vector should
> >>>>> be
> >>>>> non-nullable, and nulls should be handled in the index vector?
> >>>>>
> >>>>> Brian
> >>>>>
> >>>>>
> >>>>> On 11/08/2017 05:24 PM, Wes McKinney wrote:
> >>>>>>
> >>>>>> Per Jacques' comment in ARROW-1693
> >>>>>>
> >>>>>
> >>>>> https://issues.apache.org/jira/browse/ARROW-1693?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=16244812#comment-16244812
> >>>>> ,
> >>>>>>
> >>>>>> I think we should remove the buffer layout from the metadata. It
> would
> >>>>>> be a good idea to do this for 0.8.0 since we're breaking the
> metadata
> >>>>>> anyway.
> >>>>>>
> >>>>>> In addition to bloating the size of the schemas on the wire, the
> >>>>>> buffer layout metadata provides redundant information which should
> be
> >>>>>> a strict part of the Arrow specification. I agree with Jacques that
> it
> >>>>>> would be better to write down exactly what buffers are supposed to
> go
> >>>>>> on the wire for each logical type. In the case of the dictionary
> >>>>>> vectors, it is the buffers for the indices, so the issue under
> >>>>>> discussion resolves itself if we nix the metadata.
> >>>>>>
> >>>>>> If writers are emitting possibly different buffer layouts (like
> >>>>>> omitting a null or zero-length buffer), it will introduce
> brittleness
> >>>>>> and cause much special casing to trickle down into the reader
> >>>>>> implementations. This seems like undue complexity.
> >>>>>>
> >>>>>> - Wes
> >>>>>>
> >>>>>> On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <
> brian.hule...@ccri.com>
> >>>>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> We've been having some integration issues with reading Dictionary
> >>>>>
> >>>>> Vectors in
> >>>>>>>
> >>>>>>> the JS implementation - our current implementation can read arrow
> >>>>>>> files
> >>>>>
> >>>>> and
> >>>>>>>
> >>>>>>> streams generated by Java, but not by C++. Most of this discussion
> is
> >>>>>>> captured in ARROW-1693 [1].
> >>>>>>>
> >>>>>>> It looks like ultimately the issue is that there are
> inconsistencies
> >>>>>>> in
> >>>>>
> >>>>> the
> >>>>>>>
> >>>>>>> way the various implementations handle buffer layouts for
> >>>>>
> >>>>> dictionary-encoded
> >>>>>>>
> >>>>>>> vectors in the Schema message. Some places write/read the buffer
> >>>>>>> layout
> >>>>>
> >>>>> for
> >>>>>>>
> >>>>>>> the value vector (the vector found in the dictionary batch), and
> >>>>>>> others
> >>>>>>> expect the layout for the index vector (the int vector found in the
> >>>>>
> >>>>> record
> >>>>>>>
> >>>>>>> batch). Both the Java and C++ IPC readers don't seem to care about
> >>>>>>> this
> >>>>>>> portion of the Schema, which explains why the integration tests are
> >>>>>
> >>>>> passing.
> >>>>>>>
> >>>>>>> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers
> >>>>>>> and
> >>>>>>> writers handle those buffers layouts right now:
> >>>>>>>
> >>>>>>>         | Writer       | Reader
> >>>>>>> -----+--------------+-------------
> >>>>>>> Java | value vector | doesn't care
> >>>>>>> C++  | index vector | doesn't care
> >>>>>>> JS   | N/A          | value vector
> >>>>>>>
> >>>>>>> Note that I can only really speak with authority about the JS
> >>>>>>> implementation. I'd appreciate it if people more familiar with the
> >>>>>
> >>>>> other two
> >>>>>>>
> >>>>>>> could validate my claims.
> >>>>>>>
> >>>>>>> As far as I can tell the expected behavior isn't stated anywhere in
> >>>>>>> the
> >>>>>>> documentation, which I suppose explains the inconsistency. Paul
> >>>>>>> Taylor
> >>>>>
> >>>>> is
> >>>>>>>
> >>>>>>> currently working on resolving ARROW-1693 by making the JS reader
> >>>>>
> >>>>> ambivalent
> >>>>>>>
> >>>>>>> to buffer layout, but I think ultimately the correct solution is to
> >>>>>
> >>>>> agree on
> >>>>>>>
> >>>>>>> a consistent standard, and make the reader implementations
> >>>>>>> opinionated
> >>>>>
> >>>>> about
> >>>>>>>
> >>>>>>> the Schema buffer layouts (i.e. ARROW-1362 [2]).
> >>>>>>>
> >>>>>>> Personally, I don't really have an opinion either way about which
> >>>>>
> >>>>> vector's
> >>>>>>>
> >>>>>>> layout should be in the Schema. Either way we'll be missing some
> >>>>>>> layout
> >>>>>>> information though, so we should also consider where the
> information
> >>>>>
> >>>>> for the
> >>>>>>>
> >>>>>>> "other" vector might go.
> >>>>>>>
> >>>>>>> I know there's a release coming up, and now is probably not the
> time
> >>>>>>> to
> >>>>>>> tackle this problem, but I wanted to write it up while its fresh in
> >>>>>>> my
> >>>>>
> >>>>> mind.
> >>>>>>>
> >>>>>>> I'm fine shelving it until after 0.8.
> >>>>>>>
> >>>>>>> Brian
> >>>>>>>
> >>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-1693
> >>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-1362
> >>>>>
> >>>>>
> >
>

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Reply via email to