Yep, see https://github.com/apache/arrow/blob/master/format/Layout.md#null-bitmaps
"Arrays having a 0 null count may choose to not allocate the null bitmap." I do not know what the Java library will do in the event of 0 null count and 0-length validity bitmap -- in theory this should be accounted for already in the integration tests, but we might want to double check for our own sanity. In C++ we do not even bother with the validity bitmap when the null count is 0 https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L142 - Wes On Thu, Nov 9, 2017 at 12:09 PM, Brian Hulette <brian.hule...@ccri.com> wrote: > Ah! It didn't occur to me that a producer could just send a length-0 buffer > since the reader implementations should ignore it anyway. I don't mind the > 16 byte cost of the metadata - I was referring to the bloat of a 100% valid > vector, which could be substantial. > > Part of me wants to argue that the the dictionary indices are already > "special", since all other fields, including children, have their own > nullable attribute in the schema, while an index is specified by a lone > indexType. Treating the index more like a field makes it less special, in my > opinion. But that's just semantics, the ability to send length-0 buffers for > a 100% valid index accomplishes what I'm after. > > Brian > > > > On 11/09/2017 10:00 AM, Wes McKinney wrote: >>> >>> So I'll go after the other validity vector - maybe producers should be >>> allowed to omit the validity vector in the index? I just think if the goal >>> is to reduce bloat then redundant validity vectors seems like a logical >>> place to trim. >> >> Well, the cost of the additional buffer metadata is only 16 bytes -- >> on the wire I believe you are free to send a length-0 buffer if there >> are no nulls. I am not sure this is worth making the dictionary >> indices "special" during IPC reconstruction versus any other integer >> vector. >> >> The metadata bloat that we're trimming by removing the buffer layouts >> is more significant because the VectorLayout is a table, which has a >> larger footprint in Flatbuffers >> >> On Thu, Nov 9, 2017 at 9:20 AM, Brian Hulette <brian.hule...@ccri.com> >> wrote: >>> >>> Good point. Its a nice feature of the format that a dictionary batch and >>> a >>> record batch with a single column look exactly the same when they >>> represent >>> the same logical type. >>> >>> >>> So I'll go after the other validity vector - maybe producers should be >>> allowed to omit the validity vector in the index? I just think if the >>> goal >>> is to reduce bloat then redundant validity vectors seems like a logical >>> place to trim. >>> >>> Producers would need some way to communicate whether or not the index is >>> nullable. Right now there's only a single nullable flag in the Field >>> metadata >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280), >>> which >>> determines whether or not both the index and the value vectors have a >>> validity vector. What if there were a second nullable flag in the >>> DictionaryEncoding table >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252) that >>> applies to the indexType? >>> >>> This idea does lead to one confusing edge case: a non-nullable dictionary >>> vector with a nullable index. Maybe that should be allowed though, that >>> would effectively represent the scheme that I was originally advocating >>> for >>> (validity vector in the index and not in the values). >>> >>> Brian >>> >>> >>> >>> On 11/08/2017 06:27 PM, Wes McKinney wrote: >>>> >>>> The dictionary batches simply wrap a record batch with one “column”. >>>> There >>>> should be no code difference (e.g. buffer layouts are the same) between >>>> the >>>> code handling the data in a dictionary and a normal record batches. In >>>> general, a dictionary may contain a null. >>>> >>>> On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com> >>>> wrote: >>>> >>>>> Agreed, that sounds like a great solution to this problem - the layout >>>>> information is redundant and it doesn't make sense to include it in >>>>> every schema. >>>>> >>>>> Although I would argue we should write down exactly what buffers are >>>>> supposed to go on the wire in the dictionary batches (i.e. value >>>>> vectors) as well. This should be largely the same as what goes on the >>>>> wire in a record batch for a non dictionary-encoded vector of the same >>>>> type, but there could be a difference. For example, if a dictionary >>>>> vector is nullable, do we really need a validity buffer in both the >>>>> index and in the values? I think that's the current behavior, but maybe >>>>> it would make sense to assert that a dictionary's value vector should >>>>> be >>>>> non-nullable, and nulls should be handled in the index vector? >>>>> >>>>> Brian >>>>> >>>>> >>>>> On 11/08/2017 05:24 PM, Wes McKinney wrote: >>>>>> >>>>>> Per Jacques' comment in ARROW-1693 >>>>>> >>>>> >>>>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812 >>>>> , >>>>>> >>>>>> I think we should remove the buffer layout from the metadata. It would >>>>>> be a good idea to do this for 0.8.0 since we're breaking the metadata >>>>>> anyway. >>>>>> >>>>>> In addition to bloating the size of the schemas on the wire, the >>>>>> buffer layout metadata provides redundant information which should be >>>>>> a strict part of the Arrow specification. I agree with Jacques that it >>>>>> would be better to write down exactly what buffers are supposed to go >>>>>> on the wire for each logical type. In the case of the dictionary >>>>>> vectors, it is the buffers for the indices, so the issue under >>>>>> discussion resolves itself if we nix the metadata. >>>>>> >>>>>> If writers are emitting possibly different buffer layouts (like >>>>>> omitting a null or zero-length buffer), it will introduce brittleness >>>>>> and cause much special casing to trickle down into the reader >>>>>> implementations. This seems like undue complexity. >>>>>> >>>>>> - Wes >>>>>> >>>>>> On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <brian.hule...@ccri.com> >>>>> >>>>> wrote: >>>>>>> >>>>>>> We've been having some integration issues with reading Dictionary >>>>> >>>>> Vectors in >>>>>>> >>>>>>> the JS implementation - our current implementation can read arrow >>>>>>> files >>>>> >>>>> and >>>>>>> >>>>>>> streams generated by Java, but not by C++. Most of this discussion is >>>>>>> captured in ARROW-1693 [1]. >>>>>>> >>>>>>> It looks like ultimately the issue is that there are inconsistencies >>>>>>> in >>>>> >>>>> the >>>>>>> >>>>>>> way the various implementations handle buffer layouts for >>>>> >>>>> dictionary-encoded >>>>>>> >>>>>>> vectors in the Schema message. Some places write/read the buffer >>>>>>> layout >>>>> >>>>> for >>>>>>> >>>>>>> the value vector (the vector found in the dictionary batch), and >>>>>>> others >>>>>>> expect the layout for the index vector (the int vector found in the >>>>> >>>>> record >>>>>>> >>>>>>> batch). Both the Java and C++ IPC readers don't seem to care about >>>>>>> this >>>>>>> portion of the Schema, which explains why the integration tests are >>>>> >>>>> passing. >>>>>>> >>>>>>> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers >>>>>>> and >>>>>>> writers handle those buffers layouts right now: >>>>>>> >>>>>>> | Writer | Reader >>>>>>> -----+--------------+------------- >>>>>>> Java | value vector | doesn't care >>>>>>> C++ | index vector | doesn't care >>>>>>> JS | N/A | value vector >>>>>>> >>>>>>> Note that I can only really speak with authority about the JS >>>>>>> implementation. I'd appreciate it if people more familiar with the >>>>> >>>>> other two >>>>>>> >>>>>>> could validate my claims. >>>>>>> >>>>>>> As far as I can tell the expected behavior isn't stated anywhere in >>>>>>> the >>>>>>> documentation, which I suppose explains the inconsistency. Paul >>>>>>> Taylor >>>>> >>>>> is >>>>>>> >>>>>>> currently working on resolving ARROW-1693 by making the JS reader >>>>> >>>>> ambivalent >>>>>>> >>>>>>> to buffer layout, but I think ultimately the correct solution is to >>>>> >>>>> agree on >>>>>>> >>>>>>> a consistent standard, and make the reader implementations >>>>>>> opinionated >>>>> >>>>> about >>>>>>> >>>>>>> the Schema buffer layouts (i.e. ARROW-1362 [2]). >>>>>>> >>>>>>> Personally, I don't really have an opinion either way about which >>>>> >>>>> vector's >>>>>>> >>>>>>> layout should be in the Schema. Either way we'll be missing some >>>>>>> layout >>>>>>> information though, so we should also consider where the information >>>>> >>>>> for the >>>>>>> >>>>>>> "other" vector might go. >>>>>>> >>>>>>> I know there's a release coming up, and now is probably not the time >>>>>>> to >>>>>>> tackle this problem, but I wanted to write it up while its fresh in >>>>>>> my >>>>> >>>>> mind. >>>>>>> >>>>>>> I'm fine shelving it until after 0.8. >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-1693 >>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-1362 >>>>> >>>>> >