Since the data is immutable, not having to ever recompute the null count has a lot of benefits -- I find that sum(isnull(x)) is a commonly examined statistic. I'm expecting we'll soon have hardware-accelerated popcount available in the implementations, so that even the null-aware code paths can cheaply skip swaths of all null data.
On Sat, Feb 20, 2016 at 12:31 PM, Jacques Nadeau <jacq...@apache.org> wrote: > I think part of it comes from the headache of all the types, etc. You get > into some sort of permutation fatigue :) > > bool v. integer: > > Interesting question. Haven't thought a lot about it but it seems like > cardinality of nulls could be a useful metric to decide algorithms. Not > sure that it is more expensive to maintain. > > On Sat, Feb 20, 2016 at 12:12 PM, Daniel Robinson <danrobinson...@gmail.com> > wrote: > >> I yield to your judgment from experience. Although I am surprised it >> wouldn't simplify the code in this case, since you will use the same >> algorithms on (and, if you never allocate a bitmask as Wes suggested, use >> the same data structures for) "nullable arrays" with null_count 0 as for >> non-nullable arrays. And formally I think it makes sense to have a nullable >> Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>, >> much like its values have either type T or null. >> >> Is there a reason to make null_count an integer? Or could it just be a bool >> that keeps track of whether there are nulls or not? >> >> >> On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacq...@apache.org> >> wrote: >> >> > Makes sense >> > >> > On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <w...@cloudera.com> wrote: >> > >> > > My expectation would be that data without nulls (as with required >> > > types) would typically not have the null bitmap allocated at, but this >> > > would be implementation dependent. For example, in builder classes, >> > > the first time a null is appended, the null bitmap could be allocated. >> > > >> > > In an IPC / wire protocol context, there would be no reason to send >> > > extra bits when the null count is 0 -- the data receiver, based on >> > > their implementation, could decide whether or not to allocate a bitmap >> > > based on that information. Since the data structures are intended as >> > > immutable, there is no specific need (to create an all-0 bitmap). >> > > >> > > On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacq...@apache.org> >> > > wrote: >> > > > We actually started there (and in fact Drill existed there for the >> last >> > > > three years). However, more and more, me and other members of that >> team >> > > > have come to the conclusion that the additional complexity isn't >> worth >> > > the >> > > > extra level of code complication. By providing the null count we can >> > > > achieve the same level of efficiency (+/- carrying around an extra >> > bitmap >> > > > which is pretty nominal in the grand scheme of things). >> > > > >> > > > Another thought could be exposing nullability as a physical property >> > and >> > > > not have be part of the logical model. That being said, I don't think >> > it >> > > is >> > > > worth the headache. >> > > > >> > > > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson < >> > > danrobinson...@gmail.com> >> > > > wrote: >> > > > >> > > >> Hi all, >> > > >> >> > > >> I like this proposal (as well as the rest of the spec so far!). But >> > why >> > > >> not go further and just store arrays that are nullable according to >> > the >> > > >> schema but have no nulls in them as "non-nullable" data >> > structures—i.e. >> > > >> structures that have no null bitmask? (After all, it would obviously >> > be >> > > a >> > > >> waste to allocate a null bitmask for arrays with null_count = 0.) So >> > > there >> > > >> will be two types on the data structure level, and two >> implementations >> > > of >> > > >> every algorithm, one for each of those types. >> > > >> >> > > >> If you do that, I'm not sure I see a reason for keeping track of >> > > >> null_count. Is there ever an efficiency gain from having that stored >> > > with >> > > >> an array? Algorithms that might introduce or remove nulls could just >> > > keep >> > > >> track of their own "null_count" that increments up from 0, and >> create >> > a >> > > >> no-nulls data structure if they never find one. >> > > >> >> > > >> I think this might also simplify the system interchange validation >> > > problem, >> > > >> since a system could just check the data-structure-level type of the >> > > input. >> > > >> (Although I'm not sure I understand why that would be necessary at >> > > >> "runtime.") >> > > >> >> > > >> Perhaps you should have different names for the data-structure-level >> > > types >> > > >> to distinguish them from the "nullable" and "non-nullable" types at >> > the >> > > >> schema level. (And also for philosophical reasons—since the arrays >> are >> > > >> immutable, "nullable" doesn't really have meaning on that level, >> does >> > > it?). >> > > >> "some_null" and "no_null"? Maybe "sparse" and "dense," although >> that >> > > too >> > > >> has a different meaning elsewhere in the spec... >> > > >> >> > > >> >> > > >> >> > > >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com> >> > > wrote: >> > > >> >> > > >> > hi folks, >> > > >> > >> > > >> > welcome to all! It's great to see so many people excited about our >> > > >> > plans to make data systems faster and more interoperable. >> > > >> > >> > > >> > In thinking about building some initial Arrow integrations, I've >> run >> > > >> > into a couple of inter-related format questions. >> > > >> > >> > > >> > The first is a proposal to add a null count to Arrow arrays. With >> > > >> > optional/nullable data, null_count == 0 will allow algorithms to >> > skip >> > > >> > the null-handling code paths and treat the data as >> > > >> > required/non-nullable, yielding performance benefits. For example: >> > > >> > >> > > >> > if (arr->null_count() == 0) { >> > > >> > ... >> > > >> > } else { >> > > >> > ... >> > > >> > } >> > > >> > >> > > >> > Relatedly, at the data structure level, there is little semantic >> > > >> > distinction between these two cases >> > > >> > >> > > >> > - Required / Non-nullable arrays >> > > >> > - Optional arrays with null count 0 >> > > >> > >> > > >> > My thoughts are that "required-ness" would best be minded at the >> > > >> > metadata / schema level, rather than tasking the lowest tier of >> data >> > > >> > structures and algorithms with handling the two semantically >> > distinct, >> > > >> > but functionally equivalent forms of data without nulls. When >> > > >> > performing analytics, it adds complexity as some operations may >> > > >> > introduce or remove nulls, which would require type metadata to be >> > > >> > massaged i.e.: >> > > >> > >> > > >> > function(required input) -> optional output, versus >> > > >> > >> > > >> > function(input [null_count == 0]) -> output [maybe null_count > >> 0]. >> > > >> > >> > > >> > In the latter case, algorithms set bits and track the number of >> > nulls >> > > >> > while constructing the output Arrow array; the former adds some >> > extra >> > > >> > complexity. >> > > >> > >> > > >> > The question of course, is where to enforce "required" in data >> > > >> > interchange. If two systems have agreed (through exchange of >> > > >> > schemas/metadata) that a particular batch of Arrow data is >> > > >> > non-nullable, I would suggest that the null_count == 0 contract be >> > > >> > validated at that point. >> > > >> > >> > > >> > Curious to hear others' thoughts on this, and please let me know >> if >> > I >> > > >> > can clarify anything I've said here. >> > > >> > >> > > >> > best, >> > > >> > Wes >> > > >> > >> > > >> >> > > >> > >>