> So I'll go after the other validity vector - maybe producers should be > allowed to omit the validity vector in the index? I just think if the goal is > to reduce bloat then redundant validity vectors seems like a logical place to > trim.
Well, the cost of the additional buffer metadata is only 16 bytes -- on the wire I believe you are free to send a length-0 buffer if there are no nulls. I am not sure this is worth making the dictionary indices "special" during IPC reconstruction versus any other integer vector. The metadata bloat that we're trimming by removing the buffer layouts is more significant because the VectorLayout is a table, which has a larger footprint in Flatbuffers On Thu, Nov 9, 2017 at 9:20 AM, Brian Hulette <brian.hule...@ccri.com> wrote: > Good point. Its a nice feature of the format that a dictionary batch and a > record batch with a single column look exactly the same when they represent > the same logical type. > > > So I'll go after the other validity vector - maybe producers should be > allowed to omit the validity vector in the index? I just think if the goal > is to reduce bloat then redundant validity vectors seems like a logical > place to trim. > > Producers would need some way to communicate whether or not the index is > nullable. Right now there's only a single nullable flag in the Field > metadata > (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280), which > determines whether or not both the index and the value vectors have a > validity vector. What if there were a second nullable flag in the > DictionaryEncoding table > (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252) that > applies to the indexType? > > This idea does lead to one confusing edge case: a non-nullable dictionary > vector with a nullable index. Maybe that should be allowed though, that > would effectively represent the scheme that I was originally advocating for > (validity vector in the index and not in the values). > > Brian > > > > On 11/08/2017 06:27 PM, Wes McKinney wrote: >> >> The dictionary batches simply wrap a record batch with one “column”. There >> should be no code difference (e.g. buffer layouts are the same) between >> the >> code handling the data in a dictionary and a normal record batches. In >> general, a dictionary may contain a null. >> >> On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com> >> wrote: >> >>> Agreed, that sounds like a great solution to this problem - the layout >>> information is redundant and it doesn't make sense to include it in >>> every schema. >>> >>> Although I would argue we should write down exactly what buffers are >>> supposed to go on the wire in the dictionary batches (i.e. value >>> vectors) as well. This should be largely the same as what goes on the >>> wire in a record batch for a non dictionary-encoded vector of the same >>> type, but there could be a difference. For example, if a dictionary >>> vector is nullable, do we really need a validity buffer in both the >>> index and in the values? I think that's the current behavior, but maybe >>> it would make sense to assert that a dictionary's value vector should be >>> non-nullable, and nulls should be handled in the index vector? >>> >>> Brian >>> >>> >>> On 11/08/2017 05:24 PM, Wes McKinney wrote: >>>> >>>> Per Jacques' comment in ARROW-1693 >>>> >>> >>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812 >>> , >>>> >>>> I think we should remove the buffer layout from the metadata. It would >>>> be a good idea to do this for 0.8.0 since we're breaking the metadata >>>> anyway. >>>> >>>> In addition to bloating the size of the schemas on the wire, the >>>> buffer layout metadata provides redundant information which should be >>>> a strict part of the Arrow specification. I agree with Jacques that it >>>> would be better to write down exactly what buffers are supposed to go >>>> on the wire for each logical type. In the case of the dictionary >>>> vectors, it is the buffers for the indices, so the issue under >>>> discussion resolves itself if we nix the metadata. >>>> >>>> If writers are emitting possibly different buffer layouts (like >>>> omitting a null or zero-length buffer), it will introduce brittleness >>>> and cause much special casing to trickle down into the reader >>>> implementations. This seems like undue complexity. >>>> >>>> - Wes >>>> >>>> On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <brian.hule...@ccri.com> >>> >>> wrote: >>>>> >>>>> We've been having some integration issues with reading Dictionary >>> >>> Vectors in >>>>> >>>>> the JS implementation - our current implementation can read arrow files >>> >>> and >>>>> >>>>> streams generated by Java, but not by C++. Most of this discussion is >>>>> captured in ARROW-1693 [1]. >>>>> >>>>> It looks like ultimately the issue is that there are inconsistencies in >>> >>> the >>>>> >>>>> way the various implementations handle buffer layouts for >>> >>> dictionary-encoded >>>>> >>>>> vectors in the Schema message. Some places write/read the buffer layout >>> >>> for >>>>> >>>>> the value vector (the vector found in the dictionary batch), and others >>>>> expect the layout for the index vector (the int vector found in the >>> >>> record >>>>> >>>>> batch). Both the Java and C++ IPC readers don't seem to care about this >>>>> portion of the Schema, which explains why the integration tests are >>> >>> passing. >>>>> >>>>> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers and >>>>> writers handle those buffers layouts right now: >>>>> >>>>> | Writer | Reader >>>>> -----+--------------+------------- >>>>> Java | value vector | doesn't care >>>>> C++ | index vector | doesn't care >>>>> JS | N/A | value vector >>>>> >>>>> Note that I can only really speak with authority about the JS >>>>> implementation. I'd appreciate it if people more familiar with the >>> >>> other two >>>>> >>>>> could validate my claims. >>>>> >>>>> As far as I can tell the expected behavior isn't stated anywhere in the >>>>> documentation, which I suppose explains the inconsistency. Paul Taylor >>> >>> is >>>>> >>>>> currently working on resolving ARROW-1693 by making the JS reader >>> >>> ambivalent >>>>> >>>>> to buffer layout, but I think ultimately the correct solution is to >>> >>> agree on >>>>> >>>>> a consistent standard, and make the reader implementations opinionated >>> >>> about >>>>> >>>>> the Schema buffer layouts (i.e. ARROW-1362 [2]). >>>>> >>>>> Personally, I don't really have an opinion either way about which >>> >>> vector's >>>>> >>>>> layout should be in the Schema. Either way we'll be missing some layout >>>>> information though, so we should also consider where the information >>> >>> for the >>>>> >>>>> "other" vector might go. >>>>> >>>>> I know there's a release coming up, and now is probably not the time to >>>>> tackle this problem, but I wanted to write it up while its fresh in my >>> >>> mind. >>>>> >>>>> I'm fine shelving it until after 0.8. >>>>> >>>>> Brian >>>>> >>>>> [1] https://issues.apache.org/jira/browse/ARROW-1693 >>>>> [2] https://issues.apache.org/jira/browse/ARROW-1362 >>> >>> >