Inline responses On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson <danrobinson...@gmail.com> wrote: > Hi Wes, > > Thanks for the response! > > I see the appeal of representing all arrays at the data structure level as > "nullable," but I think I've convinced myself that it's better to represent > them all as non-nullable, and bolt on "nullability" as a wrapper type. In > addition to the reasons I mentioned yesterday (such as localizing logic for > dealing with null values in its own class), I think there's a pretty strong > efficiency case for it. If you don't distinguish between nullable and > non-nullable data structures at the data structure level, you're > essentially hiding useful optimization information from the compiler. >
I question whether this optimization information is useful given many of the target workloads (e.g. scan-based analytics). On one code branch (if null_count > 0), a bitmap is examined either 1-bit at a time or using popcount (1 word's worth of values at a time -- this could improve CPU instruction pipelining), in the other branch the bitmap is ignored. > For example, arrays should probably have a simple API for getting a value > from a specified slot number, something like get(int32_t offset). This > makes it easier to write algorithms that don't have to worry about the > underlying memory structure of the array. If all arrays at the data > structure level are considered "nullable", this code would have to do an > "if (null_count == 0)" check every time get() is called, which is > inefficient and somewhat inelegant. > I have been expecting that any user of an Arrow C/C++ library would be expected to understand the physical memory layout; with functions to access values made available as a convenience. So I would not want you to call get(i) unless you know that it is not null and not out of bounds. > I made a demo showing how I'd envision implementing nullable arrays in C++ > with nested types, and how algorithms on them would work: > https://github.com/danrobinson/arrow-demo > > Right now it's just nullable and primitive types and arrays, but I'll try > adding list types and builders to see how it compares to the > string-handling code you linked to. > > One key advantage I see of making Nullable<T> its own nested type is that > the algorithm code can make full use of C++'s template specialization. For > example, here's how I define the versions of the Mean algorithm that work > on nullable arrays and those that work on non-nullable arrays (see > https://github.com/danrobinson/arrow-demo/blob/master/algorithms.h): > > // happens when you call Mean() on a non-nullable array > template<typename T> > double Mean(Array<T>& arr) { > return (double)Sum(arr) / (arr.length()); > } > > // happens when you call Mean() on a nullable array > template<typename T> > double Mean(Array< Nullable<T> >& nullableArr) { > return (double)Sum(nullableArr) / (nullableArr.length() - > nullableArr.null_count()); > } > > This is only possible if nullability is included in the > data-structure-level type information. > > As you suggested, the code also uses a shortcut when Sum() is called on a > "nullable" array that has null_count of 0: it just calls Sum() on its child > array (see line 24). I think this, too, is simpler if we aren't cramming > nullability logic into every array class. > > There are a few other tricks that could be used whether or not you treat > Nullable as a separate type. For example, we can implement an overloaded > get() function that takes, in addition to a slot offset, a default value > that should be returned if that slot is null (see > https://github.com/danrobinson/arrow-demo/blob/master/arrays.h#L114). This > let me write the main code for the Sum() algorithm only once, since for its > purposes, 0 is equivalent to null. (It would also presumably work with > Product() and more importantly sorting). > Cool, thanks for making a prototype! Working code is worth 1000 words =) I'm not convinced by the performance argument. For example, this snippet has multiple if-then-else branches (which could possibly throw exceptions) and so would not be advisable to put in the inner loop of algorithms: value_type get(int32_t slotNumber, value_type null_value) { check_bounds(slotNumber); return safe_get(slotNumber, null_value); } "in time critical code, throwing an exception should be the exception, not the rule." http://www.boost.org/community/error_handling.html > Again, would appreciate any thoughts. > > I definitely agree that the interface for users should be the priority > (after the memory format). My own motivation for thinking about this is my > interest in making easy APIs for creating or accessing these memory > structures in higher-level dynamically typed languages. > I agree that making Arrow data structures accessible and easy-to-use in dynamic languages is desirable. I don't think that adding such overhead to reference C++ algorithms (which should probably be as close-to-the-metal and performance-oriented at possible, and expect users to use them responsibly) is a good idea, since usability / accessibility generally comes at the expense of CPU cycles and cache efficiency. For example, it would be straightforward to build "airbags", so to speak, in a Python wrapper C extension, since performance is less critical in user APIs oriented at accessibility and safety / error-handling. In dynamic languages this is especially important as bugs in e.g. Python user code should never result in segfaults/core dumps. cheers, Wes > On Wed, Feb 24, 2016 at 6:36 PM, Wes McKinney <w...@cloudera.com> wrote: > >> hi Dan, >> >> Thanks for these thoughts! There's a few separate problems >> >> - Physical memory representation >> - Metadata >> - Implementation container type layering / user API >> >> We aren't expecting any system that uses Arrow to necessarily use one >> of the reference Arrow implementations, but I expect an ecosystem of >> tools will develop around the reference implementations (e.g. Parquet >> or Avro adapters). >> >> In the thread I started last week, I proposed striking this >> distinction and instead relegating REQUIRED (non-nullable) and >> OPTIONAL (nullable) to the domain of metadata. In my opinion, having >> nullability in the container types at all adds much code complexity, >> and the two cases of >> >> - null count == 0 >> - non-nullable >> >> Must be treated as distinct, but in general algorithmically equivalent >> (e.g. the bitmap need not be allocated, and would be ignored even if >> so in most analytics code requiring knowledge of null-ness). >> >> In trying to come up with functional type systems for representing the >> semantics of the data, I find it is most useful to start from the >> desired user API (which sufficiently captures the nuances of the >> physical memory layout) and work backwards to an implementation. >> >> I would be interested to see some rough C++ / pseudocode to help >> illustrate how this would play out in downstream user code using >> libarrow and its headers, and internal algorithms in that operate on >> arrow Array instances. I've seen many functional programmers work on >> these data structure + type system problems and end up chasing their >> tails for a very long time. On the C++ side, that code you see in >> apache/arrow is a limited prototype that I haven't used to build any >> applications yet, so it'd be great to iterate on some working >> prototypes of other layerings. For example, we could work on: >> >> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199 >> >> As pseudocode, suppose we had a function: >> >> double Sum(DoubleArray* arr) { >> double total = 0; >> double* values = arr->values(); >> if (arr->null_count() == 0) { >> for (int i = 0; i < arr->length(); ++i) { >> total += values[i]; >> } >> } else { >> for (int i = 0; i < arr->length(); ++i) { >> if (arr->NotNull(i)) { >> total += values[i]; >> } >> } >> } >> return total; >> } >> >> If you had instead NullableDoubleArray and DoubleArray, you might want >> to generate different functions. This would likely be true for any >> algorithm where nulls are semantically meaningful. >> >> For me at least, the most important aspect of Arrow is the memory >> layout. The purpose of the implementations is to provide a reference >> user API to make it as convenient and fast as possible to create >> Arrow-layout data, manipulate data in-memory, and send/receive data >> via IPC/shared-memory. >> >> - Wes >> >> On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson >> <danrobinson...@gmail.com> wrote: >> > (The below mostly uses terminology and examples from the draft spec >> > <https://github.com/apache/arrow/blob/master/format/Layout.md> and C++ >> > implementation <https://github.com/apache/arrow/tree/master/cpp>). >> > >> > Under the current spec, if I understand it correctly, there are two >> > versions of every type: a nullable version and a non-nullable version. >> > Whether a type T is nullable is specified by an attribute on that type, >> so >> > declarations look something like T [nullable]: >> > >> > Int32 [non-nullable] >> > >> > Int32 [nullable] >> > >> > >> > In contrast, a variable-length version of type T is declared using a >> nested >> > or parametric type, List<T>. >> > >> > >> > List [non-nullable] <Int32 [non-nullable]> >> > List [nullable] <Int32 [non-nullable]> >> > List [non-nullable] <Int32 [nullable]> >> > List [nullable] <Int32 [nullable] > >> > >> > Consistent with this, might it make sense to use a nested type, >> > *Nullable<T>*, to declare a nullable version of that type? A slot in an >> > array of type Nullable<T> would be able to contain either a null value or >> > any value of type T. Using this system, the above examples would look >> like: >> > >> > Int32 >> > Nullable<Int32> >> > List<Int32> >> > >> > Nullable<List<Int32>> >> > List<Nullable<Int32>> >> > Nullable<List<Nullable<Int32>>>> >> > >> > >> > This could also be paralleled by a tweak to implementations (which >> wouldn't >> > change how the data is stored in physical memory). Right now, in the C++ >> > implementation, any array might be nullable, so the base Array class >> > contains the logic necessary to deal with nullable arrays, including some >> > potentially unnecessary branching. >> > >> > Instead, much like how an array of type List<T> is instantiated as a >> > ListArray >> > < >> https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h >> > >> > that contains its own int32 array of offsets and a reference to a child >> > "values" array of type T >> > <https://github.com/apache/arrow/blob/master/format/Layout.md>, every >> array >> > of type Nullable<T> could be instantiated as a NullableArray containing >> its >> > own bitmask and a reference to a child array of type T. (It could also >> keep >> > track of the null_count, if that idea is implemented.) >> > >> > The child array would not have to know that it is wrapped by a >> > NullableArray, and we could abstract almost all of the logic for dealing >> > with null values out of the Array and ArrayBuilder classes. (ArrayBuilder >> > would need a method for incrementing an array's length without adding a >> > particular value, but it already needs that to build a child array in a >> Sparse >> > Union >> > < >> https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png >> > >> > ). >> > >> > I see a few potential additional advantages to this approach: >> > >> > - It would effectively halve the number of distinct types. >> > - As mentioned above, it would be consistent with how the Arrow spec >> > treats List types and other nested types. Treating "nullability" as a >> > special attribute on each type seems more consistent with how >> something >> > like Dremel treats both "optional" and "repeated" fields. >> > - It would also be consistent with how "option types >> > <https://en.wikipedia.org/wiki/Option_type>" are conventionally >> declared >> > in other languages. For example, C# has Nullable<T> >> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia has >> > Nullable{T} >> > < >> http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values >> >, >> > Java >> > has Optional<T> >> > <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html>, >> > and Haskell >> > has Maybe a >> > < >> https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>. >> > - The implementation would be similar to a special case of >> SparseUnion, >> > which makes sense, since type Nullable<T> is equivalent to T ∪ >> NullType, >> > where NullType is a type that can only hold the null value. >> > - If the spec ever adds a Masking<T> type that allows a selection of a >> > vector using a bitmask, I imagine it, too, would be implemented >> similarly. >> > - Parsing a type declaration as a tree would be simpler. >> > - Declarations of non-nullable types would be shorter, and >> non-nullable >> > types would be the default. I think this makes sense, since it means >> types >> > such as Int32 correspond exactly to their C-type, and people would be >> less >> > likely to be surprised by unexpected null values at runtime. >> Additionally, >> > given the overhead (in speed, memory, complexity, and safety) of >> nullable >> > types, I think programmers should have to explicitly declare them. >> > >> > I'd appreciate any thoughts, especially from those with more involvement >> in >> > the project or experience with related projects. >> > >> > One more thought (which could apply even if nullability is kept as a >> > special attribute). To make the string representation of nullable types >> > more readable, it might be worth considering a convention from C# >> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift >> > <http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>, >> > Dremel >> > < >> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf >> >, >> > and Flow <http://flowtype.org/docs/nullable-types.html>: using either >> T? or >> > ?T to mean Nullable<T>. >> > >> > Int32 >> > ?Int32 >> > List<Int32> >> > ?List<Int32> >> > List<?Int32> >> > ?List<?Int32> >> > >> > Int32 >> > Int32? >> > List<Int32> >> > List?<Int32> >> > List<Int32?> >> > List?<Int32?> >> >> My inclination is still to remove nullable from the C++ array >> container types, but to keep track of the required (non-nullable) / >> optional (nullable) bit in the type metadata so that it can be kept >> track of in IPC / schema exchange with other systems as necessary. >>