(The below mostly uses terminology and examples from the draft spec <https://github.com/apache/arrow/blob/master/format/Layout.md> and C++ implementation <https://github.com/apache/arrow/tree/master/cpp>).
Under the current spec, if I understand it correctly, there are two versions of every type: a nullable version and a non-nullable version. Whether a type T is nullable is specified by an attribute on that type, so declarations look something like T [nullable]: Int32 [non-nullable] Int32 [nullable] In contrast, a variable-length version of type T is declared using a nested or parametric type, List<T>. List [non-nullable] <Int32 [non-nullable]> List [nullable] <Int32 [non-nullable]> List [non-nullable] <Int32 [nullable]> List [nullable] <Int32 [nullable] > Consistent with this, might it make sense to use a nested type, *Nullable<T>*, to declare a nullable version of that type? A slot in an array of type Nullable<T> would be able to contain either a null value or any value of type T. Using this system, the above examples would look like: Int32 Nullable<Int32> List<Int32> Nullable<List<Int32>> List<Nullable<Int32>> Nullable<List<Nullable<Int32>>>> This could also be paralleled by a tweak to implementations (which wouldn't change how the data is stored in physical memory). Right now, in the C++ implementation, any array might be nullable, so the base Array class contains the logic necessary to deal with nullable arrays, including some potentially unnecessary branching. Instead, much like how an array of type List<T> is instantiated as a ListArray <https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h> that contains its own int32 array of offsets and a reference to a child "values" array of type T <https://github.com/apache/arrow/blob/master/format/Layout.md>, every array of type Nullable<T> could be instantiated as a NullableArray containing its own bitmask and a reference to a child array of type T. (It could also keep track of the null_count, if that idea is implemented.) The child array would not have to know that it is wrapped by a NullableArray, and we could abstract almost all of the logic for dealing with null values out of the Array and ArrayBuilder classes. (ArrayBuilder would need a method for incrementing an array's length without adding a particular value, but it already needs that to build a child array in a Sparse Union <https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png> ). I see a few potential additional advantages to this approach: - It would effectively halve the number of distinct types. - As mentioned above, it would be consistent with how the Arrow spec treats List types and other nested types. Treating "nullability" as a special attribute on each type seems more consistent with how something like Dremel treats both "optional" and "repeated" fields. - It would also be consistent with how "option types <https://en.wikipedia.org/wiki/Option_type>" are conventionally declared in other languages. For example, C# has Nullable<T> <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia has Nullable{T} <http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values>, Java has Optional<T> <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html>, and Haskell has Maybe a <https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>. - The implementation would be similar to a special case of SparseUnion, which makes sense, since type Nullable<T> is equivalent to T ∪ NullType, where NullType is a type that can only hold the null value. - If the spec ever adds a Masking<T> type that allows a selection of a vector using a bitmask, I imagine it, too, would be implemented similarly. - Parsing a type declaration as a tree would be simpler. - Declarations of non-nullable types would be shorter, and non-nullable types would be the default. I think this makes sense, since it means types such as Int32 correspond exactly to their C-type, and people would be less likely to be surprised by unexpected null values at runtime. Additionally, given the overhead (in speed, memory, complexity, and safety) of nullable types, I think programmers should have to explicitly declare them. I'd appreciate any thoughts, especially from those with more involvement in the project or experience with related projects. One more thought (which could apply even if nullability is kept as a special attribute). To make the string representation of nullable types more readable, it might be worth considering a convention from C# <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift <http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>, Dremel <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf>, and Flow <http://flowtype.org/docs/nullable-types.html>: using either T? or ?T to mean Nullable<T>. Int32 ?Int32 List<Int32> ?List<Int32> List<?Int32> ?List<?Int32> Int32 Int32? List<Int32> List?<Int32> List<Int32?> List?<Int32?>