The dictionary batches simply wrap a record batch with one “column”. There should be no code difference (e.g. buffer layouts are the same) between the code handling the data in a dictionary and a normal record batches. In general, a dictionary may contain a null.
On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com> wrote: > Agreed, that sounds like a great solution to this problem - the layout > information is redundant and it doesn't make sense to include it in > every schema. > > Although I would argue we should write down exactly what buffers are > supposed to go on the wire in the dictionary batches (i.e. value > vectors) as well. This should be largely the same as what goes on the > wire in a record batch for a non dictionary-encoded vector of the same > type, but there could be a difference. For example, if a dictionary > vector is nullable, do we really need a validity buffer in both the > index and in the values? I think that's the current behavior, but maybe > it would make sense to assert that a dictionary's value vector should be > non-nullable, and nulls should be handled in the index vector? > > Brian > > > On 11/08/2017 05:24 PM, Wes McKinney wrote: > > Per Jacques' comment in ARROW-1693 > > > https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812 > , > > I think we should remove the buffer layout from the metadata. It would > > be a good idea to do this for 0.8.0 since we're breaking the metadata > > anyway. > > > > In addition to bloating the size of the schemas on the wire, the > > buffer layout metadata provides redundant information which should be > > a strict part of the Arrow specification. I agree with Jacques that it > > would be better to write down exactly what buffers are supposed to go > > on the wire for each logical type. In the case of the dictionary > > vectors, it is the buffers for the indices, so the issue under > > discussion resolves itself if we nix the metadata. > > > > If writers are emitting possibly different buffer layouts (like > > omitting a null or zero-length buffer), it will introduce brittleness > > and cause much special casing to trickle down into the reader > > implementations. This seems like undue complexity. > > > > - Wes > > > > On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <brian.hule...@ccri.com> > wrote: > >> We've been having some integration issues with reading Dictionary > Vectors in > >> the JS implementation - our current implementation can read arrow files > and > >> streams generated by Java, but not by C++. Most of this discussion is > >> captured in ARROW-1693 [1]. > >> > >> It looks like ultimately the issue is that there are inconsistencies in > the > >> way the various implementations handle buffer layouts for > dictionary-encoded > >> vectors in the Schema message. Some places write/read the buffer layout > for > >> the value vector (the vector found in the dictionary batch), and others > >> expect the layout for the index vector (the int vector found in the > record > >> batch). Both the Java and C++ IPC readers don't seem to care about this > >> portion of the Schema, which explains why the integration tests are > passing. > >> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers and > >> writers handle those buffers layouts right now: > >> > >> | Writer | Reader > >> -----+--------------+------------- > >> Java | value vector | doesn't care > >> C++ | index vector | doesn't care > >> JS | N/A | value vector > >> > >> Note that I can only really speak with authority about the JS > >> implementation. I'd appreciate it if people more familiar with the > other two > >> could validate my claims. > >> > >> As far as I can tell the expected behavior isn't stated anywhere in the > >> documentation, which I suppose explains the inconsistency. Paul Taylor > is > >> currently working on resolving ARROW-1693 by making the JS reader > ambivalent > >> to buffer layout, but I think ultimately the correct solution is to > agree on > >> a consistent standard, and make the reader implementations opinionated > about > >> the Schema buffer layouts (i.e. ARROW-1362 [2]). > >> > >> Personally, I don't really have an opinion either way about which > vector's > >> layout should be in the Schema. Either way we'll be missing some layout > >> information though, so we should also consider where the information > for the > >> "other" vector might go. > >> > >> I know there's a release coming up, and now is probably not the time to > >> tackle this problem, but I wanted to write it up while its fresh in my > mind. > >> I'm fine shelving it until after 0.8. > >> > >> Brian > >> > >> [1] https://issues.apache.org/jira/browse/ARROW-1693 > >> [2] https://issues.apache.org/jira/browse/ARROW-1362 > >