Hi all,

I like this proposal (as well as the rest of the spec so far!).  But why
not go further and just store arrays that are nullable according to the
schema but have no nulls in them as "non-nullable" data structures—i.e.
structures that have no null bitmask? (After all, it would obviously be a
waste to allocate a null bitmask for arrays with null_count = 0.) So there
will be two types on the data structure level, and two implementations of
every algorithm, one for each of those types.

If you do that, I'm not sure I see a reason for keeping track of
null_count. Is there ever an efficiency gain from having that stored with
an array? Algorithms that might introduce or remove nulls could just keep
track of their own "null_count" that increments up from 0, and create a
no-nulls data structure if they never find one.

I think this might also simplify the system interchange validation problem,
since a system could just check the data-structure-level type of the input.
(Although I'm not sure I understand why that would be necessary at
"runtime.")

Perhaps you should have different names for the data-structure-level types
to distinguish them from the "nullable" and "non-nullable" types at the
schema level. (And also for philosophical reasons—since the arrays are
immutable, "nullable" doesn't really have meaning on that level, does it?).
"some_null" and "no_null"?  Maybe "sparse" and "dense," although that too
has a different meaning elsewhere in the spec...



On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com> wrote:

> hi folks,
>
> welcome to all! It's great to see so many people excited about our
> plans to make data systems faster and more interoperable.
>
> In thinking about building some initial Arrow integrations, I've run
> into a couple of inter-related format questions.
>
> The first is a proposal to add a null count to Arrow arrays. With
> optional/nullable data, null_count == 0 will allow algorithms to skip
> the null-handling code paths and treat the data as
> required/non-nullable, yielding performance benefits. For example:
>
> if (arr->null_count() == 0) {
>   ...
> } else {
>   ...
> }
>
> Relatedly, at the data structure level, there is little semantic
> distinction between these two cases
>
> - Required / Non-nullable arrays
> - Optional arrays with null count 0
>
> My thoughts are that "required-ness" would best be minded at the
> metadata / schema level, rather than tasking the lowest tier of data
> structures and algorithms with handling the two semantically distinct,
> but functionally equivalent forms of data without nulls. When
> performing analytics, it adds complexity as some operations may
> introduce or remove nulls, which would require type metadata to be
> massaged i.e.:
>
> function(required input) -> optional output, versus
>
> function(input [null_count == 0]) -> output [maybe null_count > 0].
>
> In the latter case, algorithms set bits and track the number of nulls
> while constructing the output Arrow array; the former adds some extra
> complexity.
>
> The question of course, is where to enforce "required" in data
> interchange. If two systems have agreed (through exchange of
> schemas/metadata) that a particular batch of Arrow data is
> non-nullable, I would suggest that the null_count == 0 contract be
> validated at that point.
>
> Curious to hear others' thoughts on this, and please let me know if I
> can clarify anything I've said here.
>
> best,
> Wes
>

Reply via email to