We've been having some integration issues with reading Dictionary
Vectors in the JS implementation - our current implementation can read
arrow files and streams generated by Java, but not by C++. Most of this
discussion is captured in ARROW-1693 [1].
It looks like ultimately the issue is that there are inconsistencies in
the way the various implementations handle buffer layouts for
dictionary-encoded vectors in the Schema message. Some places write/read
the buffer layout for the value vector (the vector found in the
dictionary batch), and others expect the layout for the index vector
(the int vector found in the record batch). Both the Java and C++ IPC
readers don't seem to care about this portion of the Schema, which
explains why the integration tests are passing. Here's a fun ASCII table
of how I think the Java/C++/JS IPC readers and writers handle those
buffers layouts right now:
| Writer | Reader
-----+--------------+-------------
Java | value vector | doesn't care
C++ | index vector | doesn't care
JS | N/A | value vector
Note that I can only really speak with authority about the JS
implementation. I'd appreciate it if people more familiar with the other
two could validate my claims.
As far as I can tell the expected behavior isn't stated anywhere in the
documentation, which I suppose explains the inconsistency. Paul Taylor
is currently working on resolving ARROW-1693 by making the JS reader
ambivalent to buffer layout, but I think ultimately the correct solution
is to agree on a consistent standard, and make the reader
implementations opinionated about the Schema buffer layouts (i.e.
ARROW-1362 [2]).
Personally, I don't really have an opinion either way about which
vector's layout should be in the Schema. Either way we'll be missing
some layout information though, so we should also consider where the
information for the "other" vector might go.
I know there's a release coming up, and now is probably not the time to
tackle this problem, but I wanted to write it up while its fresh in my
mind. I'm fine shelving it until after 0.8.
Brian
[1] https://issues.apache.org/jira/browse/ARROW-1693
[2] https://issues.apache.org/jira/browse/ARROW-1362