If null count is 0, the java library sets the validity vectors to all 1s. https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java#L61
On Thu, Nov 9, 2017 at 12:23 PM, Wes McKinney <wesmck...@gmail.com> wrote: > Yep, see https://github.com/apache/arrow/blob/master/format/ > Layout.md#null-bitmaps > > "Arrays having a 0 null count may choose to not allocate the null bitmap." > > I do not know what the Java library will do in the event of 0 null > count and 0-length validity bitmap -- in theory this should be > accounted for already in the integration tests, but we might want to > double check for our own sanity. In C++ we do not even bother with the > validity bitmap when the null count is 0 > > https://github.com/apache/arrow/blob/master/cpp/src/ > arrow/ipc/reader.cc#L142 > > - Wes > > On Thu, Nov 9, 2017 at 12:09 PM, Brian Hulette <brian.hule...@ccri.com> > wrote: > > Ah! It didn't occur to me that a producer could just send a length-0 > buffer > > since the reader implementations should ignore it anyway. I don't mind > the > > 16 byte cost of the metadata - I was referring to the bloat of a 100% > valid > > vector, which could be substantial. > > > > Part of me wants to argue that the the dictionary indices are already > > "special", since all other fields, including children, have their own > > nullable attribute in the schema, while an index is specified by a lone > > indexType. Treating the index more like a field makes it less special, > in my > > opinion. But that's just semantics, the ability to send length-0 buffers > for > > a 100% valid index accomplishes what I'm after. > > > > Brian > > > > > > > > On 11/09/2017 10:00 AM, Wes McKinney wrote: > >>> > >>> So I'll go after the other validity vector - maybe producers should be > >>> allowed to omit the validity vector in the index? I just think if the > goal > >>> is to reduce bloat then redundant validity vectors seems like a logical > >>> place to trim. > >> > >> Well, the cost of the additional buffer metadata is only 16 bytes -- > >> on the wire I believe you are free to send a length-0 buffer if there > >> are no nulls. I am not sure this is worth making the dictionary > >> indices "special" during IPC reconstruction versus any other integer > >> vector. > >> > >> The metadata bloat that we're trimming by removing the buffer layouts > >> is more significant because the VectorLayout is a table, which has a > >> larger footprint in Flatbuffers > >> > >> On Thu, Nov 9, 2017 at 9:20 AM, Brian Hulette <brian.hule...@ccri.com> > >> wrote: > >>> > >>> Good point. Its a nice feature of the format that a dictionary batch > and > >>> a > >>> record batch with a single column look exactly the same when they > >>> represent > >>> the same logical type. > >>> > >>> > >>> So I'll go after the other validity vector - maybe producers should be > >>> allowed to omit the validity vector in the index? I just think if the > >>> goal > >>> is to reduce bloat then redundant validity vectors seems like a logical > >>> place to trim. > >>> > >>> Producers would need some way to communicate whether or not the index > is > >>> nullable. Right now there's only a single nullable flag in the Field > >>> metadata > >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280), > >>> which > >>> determines whether or not both the index and the value vectors have a > >>> validity vector. What if there were a second nullable flag in the > >>> DictionaryEncoding table > >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252) > that > >>> applies to the indexType? > >>> > >>> This idea does lead to one confusing edge case: a non-nullable > dictionary > >>> vector with a nullable index. Maybe that should be allowed though, that > >>> would effectively represent the scheme that I was originally advocating > >>> for > >>> (validity vector in the index and not in the values). > >>> > >>> Brian > >>> > >>> > >>> > >>> On 11/08/2017 06:27 PM, Wes McKinney wrote: > >>>> > >>>> The dictionary batches simply wrap a record batch with one “column”. > >>>> There > >>>> should be no code difference (e.g. buffer layouts are the same) > between > >>>> the > >>>> code handling the data in a dictionary and a normal record batches. In > >>>> general, a dictionary may contain a null. > >>>> > >>>> On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com> > >>>> wrote: > >>>> > >>>>> Agreed, that sounds like a great solution to this problem - the > layout > >>>>> information is redundant and it doesn't make sense to include it in > >>>>> every schema. > >>>>> > >>>>> Although I would argue we should write down exactly what buffers are > >>>>> supposed to go on the wire in the dictionary batches (i.e. value > >>>>> vectors) as well. This should be largely the same as what goes on the > >>>>> wire in a record batch for a non dictionary-encoded vector of the > same > >>>>> type, but there could be a difference. For example, if a dictionary > >>>>> vector is nullable, do we really need a validity buffer in both the > >>>>> index and in the values? I think that's the current behavior, but > maybe > >>>>> it would make sense to assert that a dictionary's value vector should > >>>>> be > >>>>> non-nullable, and nulls should be handled in the index vector? > >>>>> > >>>>> Brian > >>>>> > >>>>> > >>>>> On 11/08/2017 05:24 PM, Wes McKinney wrote: > >>>>>> > >>>>>> Per Jacques' comment in ARROW-1693 > >>>>>> > >>>>> > >>>>> https://issues.apache.org/jira/browse/ARROW-1693?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=16244812#comment-16244812 > >>>>> , > >>>>>> > >>>>>> I think we should remove the buffer layout from the metadata. It > would > >>>>>> be a good idea to do this for 0.8.0 since we're breaking the > metadata > >>>>>> anyway. > >>>>>> > >>>>>> In addition to bloating the size of the schemas on the wire, the > >>>>>> buffer layout metadata provides redundant information which should > be > >>>>>> a strict part of the Arrow specification. I agree with Jacques that > it > >>>>>> would be better to write down exactly what buffers are supposed to > go > >>>>>> on the wire for each logical type. In the case of the dictionary > >>>>>> vectors, it is the buffers for the indices, so the issue under > >>>>>> discussion resolves itself if we nix the metadata. > >>>>>> > >>>>>> If writers are emitting possibly different buffer layouts (like > >>>>>> omitting a null or zero-length buffer), it will introduce > brittleness > >>>>>> and cause much special casing to trickle down into the reader > >>>>>> implementations. This seems like undue complexity. > >>>>>> > >>>>>> - Wes > >>>>>> > >>>>>> On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette < > brian.hule...@ccri.com> > >>>>> > >>>>> wrote: > >>>>>>> > >>>>>>> We've been having some integration issues with reading Dictionary > >>>>> > >>>>> Vectors in > >>>>>>> > >>>>>>> the JS implementation - our current implementation can read arrow > >>>>>>> files > >>>>> > >>>>> and > >>>>>>> > >>>>>>> streams generated by Java, but not by C++. Most of this discussion > is > >>>>>>> captured in ARROW-1693 [1]. > >>>>>>> > >>>>>>> It looks like ultimately the issue is that there are > inconsistencies > >>>>>>> in > >>>>> > >>>>> the > >>>>>>> > >>>>>>> way the various implementations handle buffer layouts for > >>>>> > >>>>> dictionary-encoded > >>>>>>> > >>>>>>> vectors in the Schema message. Some places write/read the buffer > >>>>>>> layout > >>>>> > >>>>> for > >>>>>>> > >>>>>>> the value vector (the vector found in the dictionary batch), and > >>>>>>> others > >>>>>>> expect the layout for the index vector (the int vector found in the > >>>>> > >>>>> record > >>>>>>> > >>>>>>> batch). Both the Java and C++ IPC readers don't seem to care about > >>>>>>> this > >>>>>>> portion of the Schema, which explains why the integration tests are > >>>>> > >>>>> passing. > >>>>>>> > >>>>>>> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers > >>>>>>> and > >>>>>>> writers handle those buffers layouts right now: > >>>>>>> > >>>>>>> | Writer | Reader > >>>>>>> -----+--------------+------------- > >>>>>>> Java | value vector | doesn't care > >>>>>>> C++ | index vector | doesn't care > >>>>>>> JS | N/A | value vector > >>>>>>> > >>>>>>> Note that I can only really speak with authority about the JS > >>>>>>> implementation. I'd appreciate it if people more familiar with the > >>>>> > >>>>> other two > >>>>>>> > >>>>>>> could validate my claims. > >>>>>>> > >>>>>>> As far as I can tell the expected behavior isn't stated anywhere in > >>>>>>> the > >>>>>>> documentation, which I suppose explains the inconsistency. Paul > >>>>>>> Taylor > >>>>> > >>>>> is > >>>>>>> > >>>>>>> currently working on resolving ARROW-1693 by making the JS reader > >>>>> > >>>>> ambivalent > >>>>>>> > >>>>>>> to buffer layout, but I think ultimately the correct solution is to > >>>>> > >>>>> agree on > >>>>>>> > >>>>>>> a consistent standard, and make the reader implementations > >>>>>>> opinionated > >>>>> > >>>>> about > >>>>>>> > >>>>>>> the Schema buffer layouts (i.e. ARROW-1362 [2]). > >>>>>>> > >>>>>>> Personally, I don't really have an opinion either way about which > >>>>> > >>>>> vector's > >>>>>>> > >>>>>>> layout should be in the Schema. Either way we'll be missing some > >>>>>>> layout > >>>>>>> information though, so we should also consider where the > information > >>>>> > >>>>> for the > >>>>>>> > >>>>>>> "other" vector might go. > >>>>>>> > >>>>>>> I know there's a release coming up, and now is probably not the > time > >>>>>>> to > >>>>>>> tackle this problem, but I wanted to write it up while its fresh in > >>>>>>> my > >>>>> > >>>>> mind. > >>>>>>> > >>>>>>> I'm fine shelving it until after 0.8. > >>>>>>> > >>>>>>> Brian > >>>>>>> > >>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-1693 > >>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-1362 > >>>>> > >>>>> > > >