On Sat, Nov 10, 2018 at 4:42 PM Paul Taylor <ptay...@apache.org> wrote: > > While I'm not qualified to debate the merits of various physical > representations inside databases, I would like to chime in from the > perspective of both an Arrow contributor and architect of perhaps one of > the more exotic applications of Arrow in the wild (client/server + > JavaScript + GPUs + graphs). > > As Wes mentioned, the Arrow design needs to accommodate a wide range of > use-cases across analytic workflows. A premium has been placed on > simplicity and consistency, as design decisions related to on-the-wire > representation have an outsize impact on the complexity and architecture > of the tools and products we all build with it. To rip off Einstein, > Arrow should be as simple as it can be, but no simpler. > > I can't speak for others here, but the questions I ask when evaluating a > physical format are things like: > > 1. What impact do these decisions have on streaming architectures, both > on-device and on-the-wire? > 2. How difficult is it to write optimized routines on massively > parallel (either on or off-chip) architectures? > 3. Does this add complexity to (already complicated) byte alignment > requirements of non-CPU devices? > 4. Is this straightforward to implement in multiple languages, and can > I support the maintenance burden necessary to fix issues as they arise? > > And specifically for handling nulls, the questions I'd pose are: > > 1. How should Arrow represent nulls in nested Struct, Map, or Union > types, which don't allocate an underlying data buffer of their own? > 2. If variable-width types encode their nulls in the offsets table, how > does that impact thread divergence? > 3. How should nulls be encoded in fixed-size binary columns? > > A validity bitmap is a consistent approach to handle nulls across the > primitive, variable width, and nested types. It's been my experience in > the past that other representations may be simpler locally, but are > either more complex or outright preclude their use when the rest of the > pipeline is taken into account. > > Additionally, the validity bitmap doesn't preclude you from using > sentinels informally in your own systems. The Arrow spec leaves the data > bytes for null positions unspecified, which you could certainly fill in > with INT_MIN for any tasks where that's most efficient or convenient, > and is very fast with a trivial bit of opencl or cuda (this is a common > technique working with GPUs, as it's often faster to scan over a > streaming dataset up front vs. incur the cost of branching or thread > divergence later). One might argue this is abusing the format, but from > the perspective of shipping high-quality features quickly, I've found > Arrow's flexibility in this area and others a net plus.
To follow up on this point, one way to address this "informal" use of sentinels is to add metadata to schemas [1] so that you can distinguish endogenous datasets from ones that came from some other Arrow producer. We're pretty keen on encouraging portable algorithm development (so a routine can run on Arrow memory from any producer), so something like this would result in non-portable algorithms being written, but it's up to developers to do what best suits their needs. [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L265 > > Best, > Paul > > On 11/9/18 3:53 PM, Wes McKinney wrote: > > hi Matt, > > On Fri, Nov 9, 2018 at 6:36 PM Matt Dowle <mattjdo...@gmail.com> wrote: > >> On Fri, Nov 9, 2018 at 2:14 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> > >>> On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <mattjdo...@gmail.com> wrote: > >>>>> There is one database that I'm aware of that uses sentinels _and_ > >>>> supports complex types with missing values: Kx's KDB+. > >>>> I read this and was pleased that KDB is being used as a reference. It > >>> is a > >>>> seriously good database: the gold-standard in many people's eyes. > >>>> > >>>>> This has led to some seriously strange choices like the ASCII space > >>>> character being used as the sentinel value for strings. > >>>> But then I saw this. Surely if sentinels are good enough for KDB then > >>> isn't > >>>> that a sign that sentinels are not as bad as this group fears? > >>> KDB has a good reputation in the financial world, but it is a very > >>> niche product. I personally wouldn't draw any inferences about > >>> database design from something with such a small and specialized > >>> audience. > >>> > >> I find this view hard to understand. Why not draw some inference from a > >> highly respected product. > > To make an analogy, KDB is like a Formula 1 car. Formula 1 cars are > > built to drive in a very particular way in a particular environment, > > and I don't think it's representative of driving or car design in > > general. KDB cannot be used interchangeably where PostgreSQL is used, > > for example. > > > >> > >>>> What about grouping and joining columns that contain NA? Here's an > >>>> example from R data.table : > >>>> > >>>>> DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6) > >>>>> DT > >>>> x v > >>>> <num> <int> > >>>> 1: 1 1 > >>>> 2: 3 2 > >>>> 3: 3 3 > >>>> 4: NA 4 > >>>> 5: 1 5 > >>>> 6: NA 6 > >>>>> DT[,sum(v),keyby=x] > >>>> x V1 > >>>> <num> <int> > >>>> 1: NA 10 > >>>> 2: 1 6 > >>>> 3: 3 5 > >>>> > >>>> The NAs are grouped as a distinct value and are not excluded for > >>>> statistical robustness reasons. This is very easy to achieve efficiently > >>>> internally; in fact there is no special code to deal with the NA values > >>>> because they are just another distinct value (the sentinel). In Arrow > >>> if a > >>>> bitmap is present, there would be more code needed to deal with the NAs > >>>> (either way: including the NA group or excluding the NA group), if I > >>>> understand correctly. > >>> It depends on who's doing the analysis. Some database systems exclude > >>> nulls in aggregations altogether. In others you indeed would need to > >>> reserve an aggregation bucket to null and use the bitmap when > >>> determining which bucket to update for each value > >>> > >> Then would you agree this is a downside of the bitmap approach? Note that > >> whether or not you are including or excluding the NA group, there is still > >> more code to write with the bitmap approach. Under the sentinel approach > >> not only is there nothing special to write, but the NA group automatically > >> comes first because it sorts first (INT_MIN). When returning groups in > >> first-appearance order, there is again no special code for NAs because the > >> sentinel is just another value. Binary search in a column containing NA > >> is again very convenient (no special code for NA) since INT_MIN comes > >> first. > >> > >> In case it wasn't clear: I would really like to use Arrow. > >> > > Database systems in general find sentinel values to be unacceptable > > from a data representation point of view, as they need to be able to > > distinguish null from any single value. We need Arrow to be a > > technology that can be adopted widely by database systems, data > > science tools, and other systems that process data. I think that if we > > have to expend some extra effort in some cases in the interest of > > creating an inclusive, widely applicable and adopted technology, that > > is a worthwhile compromise. > > > >>>> On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <cpcl...@gmail.com> wrote: > >>>> > >>>>> There is one database that I'm aware of that uses sentinels _and_ > >>> supports > >>>>> complex types with missing values: Kx's KDB+. This has led to some > >>>>> seriously strange choices like the ASCII space character being used as > >>> the > >>>>> sentinel value for strings. See > >>>>> https://code.kx.com/wiki/Reference/Datatypes for > >>>>> more details. > >>>>> > >>>>> On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>>>>> hey Matt, > >>>>>> > >>>>>> Thanks for giving your perspective on the mailing list. > >>>>>> > >>>>>> My objective in writing about this recently > >>>>>> (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I > >>>>>> need to update since the sentinel case can be done more efficiently > >>>>>> than what's there now) was to help dispel the notion that using a > >>>>>> separate value (bit or byte) to encode nullness is a performance > >>>>>> compromise to comply with the requirements of database systems. I too > >>>>>> prefer real world benchmarks to microbenchmarks, and probably null > >>>>>> checking is not going to be the main driver of aggregate system > >>>>>> performance. I had heard many people over the years object to bitmaps > >>>>>> on performance grounds but without analysis to back it up. > >>>>>> > >>>>>> Some context for other readers on the mailing list: A language like R > >>>>>> is not a database and has fewer built-in scalar types: int32, double, > >>>>>> string (interned), and boolean. Out of these, int32 and double can > >>> use > >>>>>> one bit pattern for NA (null) and not lose too much. A database > >>> system > >>>>>> generally can't make that kind of compromise, and most popular > >>>>>> databases can distinguish INT32_MIN (or any other value used as a > >>>>>> sentinel) and null. If you loaded data from an Avro or Parquet file > >>>>>> that contained one of those values, you'd have to decide what to do > >>>>>> with the data (though I understand there's integer64 add-on packages > >>>>>> for R now) > >>>>>> > >>>>>> Now back to Arrow -- we have 3 main kinds of data types: > >>>>>> > >>>>>> * Fixed size primitive > >>>>>> * Variable size primitive (binary, utf8) > >>>>>> * Nested (list, struct, union) > >>>>>> > >>>>>> Out of these, "fixed size primitive" is the only one that can > >>>>>> generally support O(1) in-place mutation / updates, though all of > >>> them > >>>>>> could support a O(1) "make null" operation (by zeroing a bit). In > >>>>>> general, when faced with designs we have preferred choices benefiting > >>>>>> use cases where datasets are treated as immutable or copy-on-write. > >>>>>> > >>>>>> If an application _does_ need to do mutation on primitive arrays, > >>> then > >>>>>> you could choose to always allocate the validity bitmap so that it > >>> can > >>>>>> be mutated without requiring allocations to happen arbitrarily in > >>> your > >>>>>> processing workflow. But, if you have data without nulls, it is a > >>> nice > >>>>>> feature to be able to ignore the bitmap or not allocate one at all. > >>> If > >>>>>> you constructed an array from data that you know to be non-nullable, > >>>>>> some implementations might wish to avoid the waste of creating a > >>>>>> bitmap with all 1's. > >>>>>> > >>>>>> For example, if we create an array::Array from a normal NumPy array > >>> of > >>>>>> integers (which cannot have nulls), we have > >>>>>> > >>>>>> In [6]: import pyarrow as pa > >>>>>> In [7]: import numpy as np > >>>>>> In [8]: arr = pa.array(np.array([1, 2, 3, 4])) > >>>>>> > >>>>>> In [9]: arr.buffers() > >>>>>> Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>] > >>>>>> > >>>>>> In [10]: arr.null_count > >>>>>> Out[10]: 0 > >>>>>> > >>>>>> Normally, the first buffer would be the validity bitmap memory, but > >>>>>> here it was not allocated because there are no nulls. > >>>>>> > >>>>>> Creating an open standard data representation is a difficult thing; > >>>>>> one cannot be "all things to all people" but the intent is to be a > >>>>>> suitable lingua franca for language agnostic data interchange and as > >>> a > >>>>>> runtime representation for analytical query engines (where most > >>>>>> operators are "pure"). If the Arrow community's goal were to create a > >>>>>> "mutable column store" then some things might be designed differently > >>>>>> (perhaps more like internals of https://kudu.apache.org/). It is > >>>>>> helpful to have an understanding of what compromises have been made > >>>>>> and how costly they are in real world applications. > >>>>>> > >>>>>> best > >>>>>> Wes > >>>>>> On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org> > >>>>> wrote: > >>>>>>> On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com> > >>>>> wrote: > >>>>>>>> 1. I see. Good idea. Can we assume bitmap is always present in > >>> Arrow > >>>>>> then? > >>>>>>>> I thought I'd seen Wes argue that if there were no NAs, the > >>> bitmap > >>>>>> doesn't > >>>>>>>> need to be allocated. Indeed I wasn't worried about the extra > >>>>> storage, > >>>>>>>> although for 10,000 columns I wonder about the number of vectors. > >>>>>>>> > >>>>>>> I think different implementations handle this differently at the > >>>>> moment. > >>>>>> In > >>>>>>> the Java code, we allocate the validity buffer at initial > >>> allocation > >>>>>>> always. We're also looking to enhance the allocation strategy so > >>> the > >>>>>> fixed > >>>>>>> part of values are always allocated with validity (single > >>> allocation) > >>>>> to > >>>>>>> avoid any extra object housekeeping. > >>>>>>> > >>>>>>> > >>>>>>>> 2. It's only subjective until the code complexity is measured, > >>> then > >>>>>> it's > >>>>>>>> not subjective. I suppose after 20 years of using sentinels, I'm > >>> used > >>>>>> to it > >>>>>>>> and trust it. I'll keep an open mind on this. > >>>>>>>> > >>>>>>> Yup, fair enough. > >>>>>>> > >>>>>>> > >>>>>>>> 3. Since I criticized the scale of Wes' benchmark, I felt I > >>> should > >>>>>> show how > >>>>>>>> I do benchmarks myself to show where I'm coming from. Yes > >>> none-null, > >>>>>>>> some-null and all-null paths offer savings. But that's the same > >>> under > >>>>>> both > >>>>>>>> sentinel and bitmap approaches. Under both approaches, you just > >>> need > >>>>> to > >>>>>>>> know which case you're in. That involves storing the number of > >>> NAs in > >>>>>> the > >>>>>>>> header/summary which can be done under both approaches. > >>>>>>>> > >>>>>>> The item we appreciate is that you can do a single comparison > >>> every 64 > >>>>>>> values to determine which of the three cases you are in (make this > >>> a > >>>>>> local > >>>>>>> decision). This means you don't have to do housekeeping ahead of > >>> time. > >>>>> It > >>>>>>> also means that the window of choice is narrow, minimizing the > >>> penalty > >>>>> in > >>>>>>> situations where you have rare invalid values (or rare valid > >>> values).