The dictionary batches simply wrap a record batch with one “column”. There
should be no code difference (e.g. buffer layouts are the same) between the
code handling the data in a dictionary and a normal record batches. In
general, a dictionary may contain a null.

On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com> wrote:

> Agreed, that sounds like a great solution to this problem - the layout
> information is redundant and it doesn't make sense to include it in
> every schema.
>
> Although I would argue we should write down exactly what buffers are
> supposed to go on the wire in the dictionary batches (i.e. value
> vectors) as well. This should be largely the same as what goes on the
> wire in a record batch for a non dictionary-encoded vector of the same
> type, but there could be a difference. For example, if a dictionary
> vector is nullable, do we really need a validity buffer in both the
> index and in the values? I think that's the current behavior, but maybe
> it would make sense to assert that a dictionary's value vector should be
> non-nullable, and nulls should be handled in the index vector?
>
> Brian
>
>
> On 11/08/2017 05:24 PM, Wes McKinney wrote:
> > Per Jacques' comment in ARROW-1693
> >
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812
> ,
> > I think we should remove the buffer layout from the metadata. It would
> > be a good idea to do this for 0.8.0 since we're breaking the metadata
> > anyway.
> >
> > In addition to bloating the size of the schemas on the wire, the
> > buffer layout metadata provides redundant information which should be
> > a strict part of the Arrow specification. I agree with Jacques that it
> > would be better to write down exactly what buffers are supposed to go
> > on the wire for each logical type. In the case of the dictionary
> > vectors, it is the buffers for the indices, so the issue under
> > discussion resolves itself if we nix the metadata.
> >
> > If writers are emitting possibly different buffer layouts (like
> > omitting a null or zero-length buffer), it will introduce brittleness
> > and cause much special casing to trickle down into the reader
> > implementations. This seems like undue complexity.
> >
> > - Wes
> >
> > On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <brian.hule...@ccri.com>
> wrote:
> >> We've been having some integration issues with reading Dictionary
> Vectors in
> >> the JS implementation - our current implementation can read arrow files
> and
> >> streams generated by Java, but not by C++. Most of this discussion is
> >> captured in ARROW-1693 [1].
> >>
> >> It looks like ultimately the issue is that there are inconsistencies in
> the
> >> way the various implementations handle buffer layouts for
> dictionary-encoded
> >> vectors in the Schema message. Some places write/read the buffer layout
> for
> >> the value vector (the vector found in the dictionary batch), and others
> >> expect the layout for the index vector (the int vector found in the
> record
> >> batch). Both the Java and C++ IPC readers don't seem to care about this
> >> portion of the Schema, which explains why the integration tests are
> passing.
> >> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers and
> >> writers handle those buffers layouts right now:
> >>
> >>       | Writer       | Reader
> >> -----+--------------+-------------
> >> Java | value vector | doesn't care
> >> C++  | index vector | doesn't care
> >> JS   | N/A          | value vector
> >>
> >> Note that I can only really speak with authority about the JS
> >> implementation. I'd appreciate it if people more familiar with the
> other two
> >> could validate my claims.
> >>
> >> As far as I can tell the expected behavior isn't stated anywhere in the
> >> documentation, which I suppose explains the inconsistency. Paul Taylor
> is
> >> currently working on resolving ARROW-1693 by making the JS reader
> ambivalent
> >> to buffer layout, but I think ultimately the correct solution is to
> agree on
> >> a consistent standard, and make the reader implementations opinionated
> about
> >> the Schema buffer layouts (i.e. ARROW-1362 [2]).
> >>
> >> Personally, I don't really have an opinion either way about which
> vector's
> >> layout should be in the Schema. Either way we'll be missing some layout
> >> information though, so we should also consider where the information
> for the
> >> "other" vector might go.
> >>
> >> I know there's a release coming up, and now is probably not the time to
> >> tackle this problem, but I wanted to write it up while its fresh in my
> mind.
> >> I'm fine shelving it until after 0.8.
> >>
> >> Brian
> >>
> >> [1] https://issues.apache.org/jira/browse/ARROW-1693
> >> [2] https://issues.apache.org/jira/browse/ARROW-1362
>
>

Reply via email to