(The below mostly uses terminology and examples from the draft spec
<https://github.com/apache/arrow/blob/master/format/Layout.md> and C++
implementation <https://github.com/apache/arrow/tree/master/cpp>).

Under the current spec, if I understand it correctly, there are two
versions of every type: a nullable version and a non-nullable version.
Whether a type T is nullable is specified by an attribute on that type, so
declarations look something like T [nullable]:

Int32 [non-nullable]

Int32 [nullable]


In contrast, a variable-length version of type T is declared using a nested
or parametric type, List<T>.


List [non-nullable] <Int32 [non-nullable]>
List [nullable] <Int32 [non-nullable]>
List [non-nullable] <Int32 [nullable]>
List [nullable] <Int32 [nullable] >

Consistent with this, might it make sense to use a nested type,
*Nullable<T>*, to declare a nullable version of that type? A slot in an
array of type Nullable<T> would be able to contain either a null value or
any value of type T. Using this system, the above examples would look like:

Int32
Nullable<Int32>
List<Int32>

Nullable<List<Int32>>
List<Nullable<Int32>>
Nullable<List<Nullable<Int32>>>>


This could also be paralleled by a tweak to implementations (which wouldn't
change how the data is stored in physical memory). Right now, in the C++
implementation, any array might be nullable, so the base Array class
contains the logic necessary to deal with nullable arrays, including some
potentially unnecessary branching.

Instead, much like how an array of type List<T> is instantiated as a
ListArray
<https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h>
 that contains its own int32 array of offsets and a reference to a child
"values" array of type T
<https://github.com/apache/arrow/blob/master/format/Layout.md>, every array
of type Nullable<T> could be instantiated as a NullableArray containing its
own bitmask and a reference to a child array of type T. (It could also keep
track of the null_count, if that idea is implemented.)

The child array would not have to know that it is wrapped by a
NullableArray, and we could abstract almost all of the logic for dealing
with null values out of the Array and ArrayBuilder classes. (ArrayBuilder
would need a method for incrementing an array's length without adding a
particular value, but it already needs that to build a child array in a Sparse
Union
<https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png>
).

I see a few potential additional advantages to this approach:

   - It would effectively halve the number of distinct types.
   - As mentioned above, it would be consistent with how the Arrow spec
   treats List types and other nested types. Treating "nullability" as a
   special attribute on each type seems more consistent with how something
   like Dremel treats both "optional" and "repeated" fields.
   - It would also be consistent with how "option types
   <https://en.wikipedia.org/wiki/Option_type>" are conventionally declared
   in other languages. For example, C# has Nullable<T>
   <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia has
   Nullable{T}
   
<http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values>,
Java
   has Optional<T>
   <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html>,
and Haskell
   has Maybe a
   <https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>.
   - The implementation would be similar to a special case of SparseUnion,
   which makes sense, since type Nullable<T> is equivalent to T ∪ NullType,
   where NullType is a type that can only hold the null value.
   - If the spec ever adds a Masking<T> type that allows a selection of a
   vector using a bitmask, I imagine it, too, would be implemented similarly.
   - Parsing a type declaration as a tree would be simpler.
   - Declarations of non-nullable types would be shorter, and non-nullable
   types would be the default. I think this makes sense, since it means types
   such as Int32 correspond exactly to their C-type, and people would be less
   likely to be surprised by unexpected null values at runtime. Additionally,
   given the overhead (in speed, memory, complexity, and safety) of nullable
   types, I think programmers should have to explicitly declare them.

I'd appreciate any thoughts, especially from those with more involvement in
the project or experience with related projects.

One more thought (which could apply even if nullability is kept as a
special attribute). To make the string representation of nullable types
more readable, it might be worth considering a convention from C#
<https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift
<http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>,
Dremel
<http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf>,
and Flow <http://flowtype.org/docs/nullable-types.html>: using either T? or
?T to mean Nullable<T>.

Int32
?Int32
List<Int32>
?List<Int32>
List<?Int32>
?List<?Int32>

Int32
Int32?
List<Int32>
List?<Int32>
List<Int32?>
List?<Int32?>

Reply via email to