Re: Format: storing null count + required/non-nullable types

Daniel Robinson Sat, 20 Feb 2016 12:18:06 -0800

I yield to your judgment from experience. Although I am surprised it
wouldn't simplify the code in this case, since you will use the same
algorithms on (and, if you never allocate a bitmask as Wes suggested, use
the same data structures for) "nullable arrays" with null_count 0 as for
non-nullable arrays. And formally I think it makes sense to have a nullable
Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>,
much like its values have either type T or null.


Is there a reason to make null_count an integer? Or could it just be a bool
that keeps track of whether there are nulls or not?


On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <[email protected]> wrote:

> Makes sense
>
> On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <[email protected]> wrote:
>
> > My expectation would be that data without nulls (as with required
> > types) would typically not have the null bitmap allocated at, but this
> > would be implementation dependent. For example, in builder classes,
> > the first time a null is appended, the null bitmap could be allocated.
> >
> > In an IPC / wire protocol context, there would be no reason to send
> > extra bits when the null count is 0 -- the data receiver, based on
> > their implementation, could decide whether or not to allocate a bitmap
> > based on that information. Since the data structures are intended as
> > immutable, there is no specific need (to create an all-0 bitmap).
> >
> > On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <[email protected]>
> > wrote:
> > > We actually started there (and in fact Drill existed there for the last
> > > three years). However, more and more, me and other members of that team
> > > have come to the conclusion that the additional complexity isn't worth
> > the
> > > extra level of code complication. By providing the null count we can
> > > achieve the same level of efficiency (+/- carrying around an extra
> bitmap
> > > which is pretty nominal in the grand scheme of things).
> > >
> > > Another thought could be exposing nullability as a physical property
> and
> > > not have be part of the logical model. That being said, I don't think
> it
> > is
> > > worth the headache.
> > >
> > > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <
> > [email protected]>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> I like this proposal (as well as the rest of the spec so far!).  But
> why
> > >> not go further and just store arrays that are nullable according to
> the
> > >> schema but have no nulls in them as "non-nullable" data
> structures—i.e.
> > >> structures that have no null bitmask? (After all, it would obviously
> be
> > a
> > >> waste to allocate a null bitmask for arrays with null_count = 0.) So
> > there
> > >> will be two types on the data structure level, and two implementations
> > of
> > >> every algorithm, one for each of those types.
> > >>
> > >> If you do that, I'm not sure I see a reason for keeping track of
> > >> null_count. Is there ever an efficiency gain from having that stored
> > with
> > >> an array? Algorithms that might introduce or remove nulls could just
> > keep
> > >> track of their own "null_count" that increments up from 0, and create
> a
> > >> no-nulls data structure if they never find one.
> > >>
> > >> I think this might also simplify the system interchange validation
> > problem,
> > >> since a system could just check the data-structure-level type of the
> > input.
> > >> (Although I'm not sure I understand why that would be necessary at
> > >> "runtime.")
> > >>
> > >> Perhaps you should have different names for the data-structure-level
> > types
> > >> to distinguish them from the "nullable" and "non-nullable" types at
> the
> > >> schema level. (And also for philosophical reasons—since the arrays are
> > >> immutable, "nullable" doesn't really have meaning on that level, does
> > it?).
> > >> "some_null" and "no_null"?  Maybe "sparse" and "dense," although that
> > too
> > >> has a different meaning elsewhere in the spec...
> > >>
> > >>
> > >>
> > >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <[email protected]>
> > wrote:
> > >>
> > >> > hi folks,
> > >> >
> > >> > welcome to all! It's great to see so many people excited about our
> > >> > plans to make data systems faster and more interoperable.
> > >> >
> > >> > In thinking about building some initial Arrow integrations, I've run
> > >> > into a couple of inter-related format questions.
> > >> >
> > >> > The first is a proposal to add a null count to Arrow arrays. With
> > >> > optional/nullable data, null_count == 0 will allow algorithms to
> skip
> > >> > the null-handling code paths and treat the data as
> > >> > required/non-nullable, yielding performance benefits. For example:
> > >> >
> > >> > if (arr->null_count() == 0) {
> > >> >   ...
> > >> > } else {
> > >> >   ...
> > >> > }
> > >> >
> > >> > Relatedly, at the data structure level, there is little semantic
> > >> > distinction between these two cases
> > >> >
> > >> > - Required / Non-nullable arrays
> > >> > - Optional arrays with null count 0
> > >> >
> > >> > My thoughts are that "required-ness" would best be minded at the
> > >> > metadata / schema level, rather than tasking the lowest tier of data
> > >> > structures and algorithms with handling the two semantically
> distinct,
> > >> > but functionally equivalent forms of data without nulls. When
> > >> > performing analytics, it adds complexity as some operations may
> > >> > introduce or remove nulls, which would require type metadata to be
> > >> > massaged i.e.:
> > >> >
> > >> > function(required input) -> optional output, versus
> > >> >
> > >> > function(input [null_count == 0]) -> output [maybe null_count > 0].
> > >> >
> > >> > In the latter case, algorithms set bits and track the number of
> nulls
> > >> > while constructing the output Arrow array; the former adds some
> extra
> > >> > complexity.
> > >> >
> > >> > The question of course, is where to enforce "required" in data
> > >> > interchange. If two systems have agreed (through exchange of
> > >> > schemas/metadata) that a particular batch of Arrow data is
> > >> > non-nullable, I would suggest that the null_count == 0 contract be
> > >> > validated at that point.
> > >> >
> > >> > Curious to hear others' thoughts on this, and please let me know if
> I
> > >> > can clarify anything I've said here.
> > >> >
> > >> > best,
> > >> > Wes
> > >> >
> > >>
> >
>

Re: Format: storing null count + required/non-nullable types

Reply via email to