I think part of it comes from the headache of all the types, etc. You get into some sort of permutation fatigue :)
bool v. integer: Interesting question. Haven't thought a lot about it but it seems like cardinality of nulls could be a useful metric to decide algorithms. Not sure that it is more expensive to maintain. On Sat, Feb 20, 2016 at 12:12 PM, Daniel Robinson <danrobinson...@gmail.com> wrote: > I yield to your judgment from experience. Although I am surprised it > wouldn't simplify the code in this case, since you will use the same > algorithms on (and, if you never allocate a bitmask as Wes suggested, use > the same data structures for) "nullable arrays" with null_count 0 as for > non-nullable arrays. And formally I think it makes sense to have a nullable > Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>, > much like its values have either type T or null. > > Is there a reason to make null_count an integer? Or could it just be a bool > that keeps track of whether there are nulls or not? > > > On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacq...@apache.org> > wrote: > > > Makes sense > > > > On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <w...@cloudera.com> wrote: > > > > > My expectation would be that data without nulls (as with required > > > types) would typically not have the null bitmap allocated at, but this > > > would be implementation dependent. For example, in builder classes, > > > the first time a null is appended, the null bitmap could be allocated. > > > > > > In an IPC / wire protocol context, there would be no reason to send > > > extra bits when the null count is 0 -- the data receiver, based on > > > their implementation, could decide whether or not to allocate a bitmap > > > based on that information. Since the data structures are intended as > > > immutable, there is no specific need (to create an all-0 bitmap). > > > > > > On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacq...@apache.org> > > > wrote: > > > > We actually started there (and in fact Drill existed there for the > last > > > > three years). However, more and more, me and other members of that > team > > > > have come to the conclusion that the additional complexity isn't > worth > > > the > > > > extra level of code complication. By providing the null count we can > > > > achieve the same level of efficiency (+/- carrying around an extra > > bitmap > > > > which is pretty nominal in the grand scheme of things). > > > > > > > > Another thought could be exposing nullability as a physical property > > and > > > > not have be part of the logical model. That being said, I don't think > > it > > > is > > > > worth the headache. > > > > > > > > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson < > > > danrobinson...@gmail.com> > > > > wrote: > > > > > > > >> Hi all, > > > >> > > > >> I like this proposal (as well as the rest of the spec so far!). But > > why > > > >> not go further and just store arrays that are nullable according to > > the > > > >> schema but have no nulls in them as "non-nullable" data > > structures—i.e. > > > >> structures that have no null bitmask? (After all, it would obviously > > be > > > a > > > >> waste to allocate a null bitmask for arrays with null_count = 0.) So > > > there > > > >> will be two types on the data structure level, and two > implementations > > > of > > > >> every algorithm, one for each of those types. > > > >> > > > >> If you do that, I'm not sure I see a reason for keeping track of > > > >> null_count. Is there ever an efficiency gain from having that stored > > > with > > > >> an array? Algorithms that might introduce or remove nulls could just > > > keep > > > >> track of their own "null_count" that increments up from 0, and > create > > a > > > >> no-nulls data structure if they never find one. > > > >> > > > >> I think this might also simplify the system interchange validation > > > problem, > > > >> since a system could just check the data-structure-level type of the > > > input. > > > >> (Although I'm not sure I understand why that would be necessary at > > > >> "runtime.") > > > >> > > > >> Perhaps you should have different names for the data-structure-level > > > types > > > >> to distinguish them from the "nullable" and "non-nullable" types at > > the > > > >> schema level. (And also for philosophical reasons—since the arrays > are > > > >> immutable, "nullable" doesn't really have meaning on that level, > does > > > it?). > > > >> "some_null" and "no_null"? Maybe "sparse" and "dense," although > that > > > too > > > >> has a different meaning elsewhere in the spec... > > > >> > > > >> > > > >> > > > >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com> > > > wrote: > > > >> > > > >> > hi folks, > > > >> > > > > >> > welcome to all! It's great to see so many people excited about our > > > >> > plans to make data systems faster and more interoperable. > > > >> > > > > >> > In thinking about building some initial Arrow integrations, I've > run > > > >> > into a couple of inter-related format questions. > > > >> > > > > >> > The first is a proposal to add a null count to Arrow arrays. With > > > >> > optional/nullable data, null_count == 0 will allow algorithms to > > skip > > > >> > the null-handling code paths and treat the data as > > > >> > required/non-nullable, yielding performance benefits. For example: > > > >> > > > > >> > if (arr->null_count() == 0) { > > > >> > ... > > > >> > } else { > > > >> > ... > > > >> > } > > > >> > > > > >> > Relatedly, at the data structure level, there is little semantic > > > >> > distinction between these two cases > > > >> > > > > >> > - Required / Non-nullable arrays > > > >> > - Optional arrays with null count 0 > > > >> > > > > >> > My thoughts are that "required-ness" would best be minded at the > > > >> > metadata / schema level, rather than tasking the lowest tier of > data > > > >> > structures and algorithms with handling the two semantically > > distinct, > > > >> > but functionally equivalent forms of data without nulls. When > > > >> > performing analytics, it adds complexity as some operations may > > > >> > introduce or remove nulls, which would require type metadata to be > > > >> > massaged i.e.: > > > >> > > > > >> > function(required input) -> optional output, versus > > > >> > > > > >> > function(input [null_count == 0]) -> output [maybe null_count > > 0]. > > > >> > > > > >> > In the latter case, algorithms set bits and track the number of > > nulls > > > >> > while constructing the output Arrow array; the former adds some > > extra > > > >> > complexity. > > > >> > > > > >> > The question of course, is where to enforce "required" in data > > > >> > interchange. If two systems have agreed (through exchange of > > > >> > schemas/metadata) that a particular batch of Arrow data is > > > >> > non-nullable, I would suggest that the null_count == 0 contract be > > > >> > validated at that point. > > > >> > > > > >> > Curious to hear others' thoughts on this, and please let me know > if > > I > > > >> > can clarify anything I've said here. > > > >> > > > > >> > best, > > > >> > Wes > > > >> > > > > >> > > > > > >