I yield to your judgment from experience. Although I am surprised it wouldn't simplify the code in this case, since you will use the same algorithms on (and, if you never allocate a bitmask as Wes suggested, use the same data structures for) "nullable arrays" with null_count 0 as for non-nullable arrays. And formally I think it makes sense to have a nullable Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>, much like its values have either type T or null.
Is there a reason to make null_count an integer? Or could it just be a bool that keeps track of whether there are nulls or not? On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacq...@apache.org> wrote: > Makes sense > > On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <w...@cloudera.com> wrote: > > > My expectation would be that data without nulls (as with required > > types) would typically not have the null bitmap allocated at, but this > > would be implementation dependent. For example, in builder classes, > > the first time a null is appended, the null bitmap could be allocated. > > > > In an IPC / wire protocol context, there would be no reason to send > > extra bits when the null count is 0 -- the data receiver, based on > > their implementation, could decide whether or not to allocate a bitmap > > based on that information. Since the data structures are intended as > > immutable, there is no specific need (to create an all-0 bitmap). > > > > On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacq...@apache.org> > > wrote: > > > We actually started there (and in fact Drill existed there for the last > > > three years). However, more and more, me and other members of that team > > > have come to the conclusion that the additional complexity isn't worth > > the > > > extra level of code complication. By providing the null count we can > > > achieve the same level of efficiency (+/- carrying around an extra > bitmap > > > which is pretty nominal in the grand scheme of things). > > > > > > Another thought could be exposing nullability as a physical property > and > > > not have be part of the logical model. That being said, I don't think > it > > is > > > worth the headache. > > > > > > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson < > > danrobinson...@gmail.com> > > > wrote: > > > > > >> Hi all, > > >> > > >> I like this proposal (as well as the rest of the spec so far!). But > why > > >> not go further and just store arrays that are nullable according to > the > > >> schema but have no nulls in them as "non-nullable" data > structures—i.e. > > >> structures that have no null bitmask? (After all, it would obviously > be > > a > > >> waste to allocate a null bitmask for arrays with null_count = 0.) So > > there > > >> will be two types on the data structure level, and two implementations > > of > > >> every algorithm, one for each of those types. > > >> > > >> If you do that, I'm not sure I see a reason for keeping track of > > >> null_count. Is there ever an efficiency gain from having that stored > > with > > >> an array? Algorithms that might introduce or remove nulls could just > > keep > > >> track of their own "null_count" that increments up from 0, and create > a > > >> no-nulls data structure if they never find one. > > >> > > >> I think this might also simplify the system interchange validation > > problem, > > >> since a system could just check the data-structure-level type of the > > input. > > >> (Although I'm not sure I understand why that would be necessary at > > >> "runtime.") > > >> > > >> Perhaps you should have different names for the data-structure-level > > types > > >> to distinguish them from the "nullable" and "non-nullable" types at > the > > >> schema level. (And also for philosophical reasons—since the arrays are > > >> immutable, "nullable" doesn't really have meaning on that level, does > > it?). > > >> "some_null" and "no_null"? Maybe "sparse" and "dense," although that > > too > > >> has a different meaning elsewhere in the spec... > > >> > > >> > > >> > > >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com> > > wrote: > > >> > > >> > hi folks, > > >> > > > >> > welcome to all! It's great to see so many people excited about our > > >> > plans to make data systems faster and more interoperable. > > >> > > > >> > In thinking about building some initial Arrow integrations, I've run > > >> > into a couple of inter-related format questions. > > >> > > > >> > The first is a proposal to add a null count to Arrow arrays. With > > >> > optional/nullable data, null_count == 0 will allow algorithms to > skip > > >> > the null-handling code paths and treat the data as > > >> > required/non-nullable, yielding performance benefits. For example: > > >> > > > >> > if (arr->null_count() == 0) { > > >> > ... > > >> > } else { > > >> > ... > > >> > } > > >> > > > >> > Relatedly, at the data structure level, there is little semantic > > >> > distinction between these two cases > > >> > > > >> > - Required / Non-nullable arrays > > >> > - Optional arrays with null count 0 > > >> > > > >> > My thoughts are that "required-ness" would best be minded at the > > >> > metadata / schema level, rather than tasking the lowest tier of data > > >> > structures and algorithms with handling the two semantically > distinct, > > >> > but functionally equivalent forms of data without nulls. When > > >> > performing analytics, it adds complexity as some operations may > > >> > introduce or remove nulls, which would require type metadata to be > > >> > massaged i.e.: > > >> > > > >> > function(required input) -> optional output, versus > > >> > > > >> > function(input [null_count == 0]) -> output [maybe null_count > 0]. > > >> > > > >> > In the latter case, algorithms set bits and track the number of > nulls > > >> > while constructing the output Arrow array; the former adds some > extra > > >> > complexity. > > >> > > > >> > The question of course, is where to enforce "required" in data > > >> > interchange. If two systems have agreed (through exchange of > > >> > schemas/metadata) that a particular batch of Arrow data is > > >> > non-nullable, I would suggest that the null_count == 0 contract be > > >> > validated at that point. > > >> > > > >> > Curious to hear others' thoughts on this, and please let me know if > I > > >> > can clarify anything I've said here. > > >> > > > >> > best, > > >> > Wes > > >> > > > >> > > >