Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Brian Hulette Thu, 09 Nov 2017 06:21:14 -0800

Good point. Its a nice feature of the format that a dictionary batch anda record batch with a single column look exactly the same when theyrepresent the same logical type.

So I'll go after the other validity vector - maybe producers should beallowed to omit the validity vector in the index? I just think if thegoal is to reduce bloat then redundant validity vectors seems like alogical place to trim.

Producers would need some way to communicate whether or not the index isnullable. Right now there's only a single nullable flag in the Fieldmetadata(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280),which determines whether or not both the index and the value vectorshave a validity vector. What if there were a second nullable flag in theDictionaryEncoding table(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252)that applies to the indexType?

This idea does lead to one confusing edge case: a non-nullabledictionary vector with a nullable index. Maybe that should be allowedthough, that would effectively represent the scheme that I wasoriginally advocating for (validity vector in the index and not in thevalues).


Brian


On 11/08/2017 06:27 PM, Wes McKinney wrote:

The dictionary batches simply wrap a record batch with one “column”. There
should be no code difference (e.g. buffer layouts are the same) between the
code handling the data in a dictionary and a normal record batches. In
general, a dictionary may contain a null.

On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com> wrote:

Agreed, that sounds like a great solution to this problem - the layout
information is redundant and it doesn't make sense to include it in
every schema.

Although I would argue we should write down exactly what buffers are
supposed to go on the wire in the dictionary batches (i.e. value
vectors) as well. This should be largely the same as what goes on the
wire in a record batch for a non dictionary-encoded vector of the same
type, but there could be a difference. For example, if a dictionary
vector is nullable, do we really need a validity buffer in both the
index and in the values? I think that's the current behavior, but maybe
it would make sense to assert that a dictionary's value vector should be
non-nullable, and nulls should be handled in the index vector?

Brian


On 11/08/2017 05:24 PM, Wes McKinney wrote:

Per Jacques' comment in ARROW-1693

https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812
,

I think we should remove the buffer layout from the metadata. It would
be a good idea to do this for 0.8.0 since we're breaking the metadata
anyway.

In addition to bloating the size of the schemas on the wire, the
buffer layout metadata provides redundant information which should be
a strict part of the Arrow specification. I agree with Jacques that it
would be better to write down exactly what buffers are supposed to go
on the wire for each logical type. In the case of the dictionary
vectors, it is the buffers for the indices, so the issue under
discussion resolves itself if we nix the metadata.

If writers are emitting possibly different buffer layouts (like
omitting a null or zero-length buffer), it will introduce brittleness
and cause much special casing to trickle down into the reader
implementations. This seems like undue complexity.

- Wes

On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <brian.hule...@ccri.com>

wrote:

We've been having some integration issues with reading Dictionary

Vectors in

the JS implementation - our current implementation can read arrow files

and

streams generated by Java, but not by C++. Most of this discussion is
captured in ARROW-1693 [1].

It looks like ultimately the issue is that there are inconsistencies in

the

way the various implementations handle buffer layouts for

dictionary-encoded

vectors in the Schema message. Some places write/read the buffer layout

for

the value vector (the vector found in the dictionary batch), and others
expect the layout for the index vector (the int vector found in the

record

batch). Both the Java and C++ IPC readers don't seem to care about this
portion of the Schema, which explains why the integration tests are

passing.

Here's a fun ASCII table of how I think the Java/C++/JS IPC readers and
writers handle those buffers layouts right now:

       | Writer       | Reader
-----+--------------+-------------
Java | value vector | doesn't care
C++  | index vector | doesn't care
JS   | N/A          | value vector

Note that I can only really speak with authority about the JS
implementation. I'd appreciate it if people more familiar with the

other two

could validate my claims.

As far as I can tell the expected behavior isn't stated anywhere in the
documentation, which I suppose explains the inconsistency. Paul Taylor

is

currently working on resolving ARROW-1693 by making the JS reader

ambivalent

to buffer layout, but I think ultimately the correct solution is to

agree on

a consistent standard, and make the reader implementations opinionated

about

the Schema buffer layouts (i.e. ARROW-1362 [2]).

Personally, I don't really have an opinion either way about which

vector's

layout should be in the Schema. Either way we'll be missing some layout
information though, so we should also consider where the information

for the

"other" vector might go.

I know there's a release coming up, and now is probably not the time to
tackle this problem, but I wanted to write it up while its fresh in my

mind.

I'm fine shelving it until after 0.8.

Brian

[1] https://issues.apache.org/jira/browse/ARROW-1693
[2] https://issues.apache.org/jira/browse/ARROW-1362

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Reply via email to