On Sat, Nov 10, 2018 at 4:42 PM Paul Taylor <ptay...@apache.org> wrote:
>
> While I'm not qualified to debate the merits of various physical
> representations inside databases, I would like to chime in from the
> perspective of both an Arrow contributor and architect of perhaps one of
> the more exotic applications of Arrow in the wild (client/server +
> JavaScript + GPUs + graphs).
>
> As Wes mentioned, the Arrow design needs to accommodate a wide range of
> use-cases across analytic workflows. A premium has been placed on
> simplicity and consistency, as design decisions related to on-the-wire
> representation have an outsize impact on the complexity and architecture
> of the tools and products we all build with it. To rip off Einstein,
> Arrow should be as simple as it can be, but no simpler.
>
> I can't speak for others here, but the questions I ask when evaluating a
> physical format are things like:
>
>  1. What impact do these decisions have on streaming architectures, both
>     on-device and on-the-wire?
>  2. How difficult is it to write optimized routines on massively
>     parallel (either on or off-chip) architectures?
>  3. Does this add complexity to (already complicated) byte alignment
>     requirements of non-CPU devices?
>  4. Is this straightforward to implement in multiple languages, and can
>     I support the maintenance burden necessary to fix issues as they arise?
>
> And specifically for handling nulls, the questions I'd pose are:
>
>  1. How should Arrow represent nulls in nested Struct, Map, or Union
>     types, which don't allocate an underlying data buffer of their own?
>  2. If variable-width types encode their nulls in the offsets table, how
>     does that impact thread divergence?
>  3. How should nulls be encoded in fixed-size binary columns?
>
> A validity bitmap is a consistent approach to handle nulls across the
> primitive, variable width, and nested types. It's been my experience in
> the past that other representations may be simpler locally, but are
> either more complex or outright preclude their use when the rest of the
> pipeline is taken into account.
>
> Additionally, the validity bitmap doesn't preclude you from using
> sentinels informally in your own systems. The Arrow spec leaves the data
> bytes for null positions unspecified, which you could certainly fill in
> with INT_MIN for any tasks where that's most efficient or convenient,
> and is very fast with a trivial bit of opencl or cuda (this is a common
> technique working with GPUs, as it's often faster to scan over a
> streaming dataset up front vs. incur the cost of branching or thread
> divergence later). One might argue this is abusing the format, but from
> the perspective of shipping high-quality features quickly, I've found
> Arrow's flexibility in this area and others a net plus.

To follow up on this point, one way to address this "informal" use of
sentinels is to add metadata to schemas [1] so that you can
distinguish endogenous datasets from ones that came from some other
Arrow producer. We're pretty keen on encouraging portable algorithm
development (so a routine can run on Arrow memory from any producer),
so something like this would result in non-portable algorithms being
written, but it's up to developers to do what best suits their needs.

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L265

>
> Best,
> Paul
>
> On 11/9/18 3:53 PM, Wes McKinney wrote:
> > hi Matt,
> > On Fri, Nov 9, 2018 at 6:36 PM Matt Dowle <mattjdo...@gmail.com> wrote:
> >> On Fri, Nov 9, 2018 at 2:14 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >>
> >>> On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <mattjdo...@gmail.com> wrote:
> >>>>> There is one database that I'm aware of that uses sentinels _and_
> >>>> supports complex types with missing values: Kx's KDB+.
> >>>> I read this and was pleased that KDB is being used as a reference.  It
> >>> is a
> >>>> seriously good database: the gold-standard in many people's eyes.
> >>>>
> >>>>> This has led to some seriously strange choices like the ASCII space
> >>>> character being used as the sentinel value for strings.
> >>>> But then I saw this. Surely if sentinels are good enough for KDB then
> >>> isn't
> >>>> that a sign that sentinels are not as bad as this group fears?
> >>> KDB has a good reputation in the financial world, but it is a very
> >>> niche product. I personally wouldn't draw any inferences about
> >>> database design from something with such a small and specialized
> >>> audience.
> >>>
> >> I find this view hard to understand. Why not draw some inference from a
> >> highly respected product.
> > To make an analogy, KDB is like a Formula 1 car. Formula 1 cars are
> > built to drive in a very particular way in a particular environment,
> > and I don't think it's representative of driving or car design in
> > general. KDB cannot be used interchangeably where PostgreSQL is used,
> > for example.
> >
> >>
> >>>> What about grouping and joining columns that contain NA?   Here's an
> >>>> example from R data.table :
> >>>>
> >>>>> DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6)
> >>>>> DT
> >>>>         x     v
> >>>>     <num> <int>
> >>>> 1:     1     1
> >>>> 2:     3     2
> >>>> 3:     3     3
> >>>> 4:    NA     4
> >>>> 5:     1     5
> >>>> 6:    NA     6
> >>>>> DT[,sum(v),keyby=x]
> >>>>         x    V1
> >>>>     <num> <int>
> >>>> 1:    NA    10
> >>>> 2:     1     6
> >>>> 3:     3     5
> >>>>
> >>>> The NAs are grouped as a distinct value and are not excluded for
> >>>> statistical robustness reasons.  This is very easy to achieve efficiently
> >>>> internally; in fact there is no special code to deal with the NA values
> >>>> because they are just another distinct value (the sentinel).  In Arrow
> >>> if a
> >>>> bitmap is present, there would be more code needed to deal with the NAs
> >>>> (either way: including the NA group or excluding the NA group), if I
> >>>> understand correctly.
> >>> It depends on who's doing the analysis. Some database systems exclude
> >>> nulls in aggregations altogether. In others you indeed would need to
> >>> reserve an aggregation bucket to null and use the bitmap when
> >>> determining which bucket to update for each value
> >>>
> >> Then would you agree this is a downside of the bitmap approach?  Note that
> >> whether or not you are including or excluding the NA group, there is still
> >> more code to write with the bitmap approach. Under the sentinel approach
> >> not only is there nothing special to write,  but the NA group automatically
> >> comes first because it sorts first (INT_MIN).  When returning groups in
> >> first-appearance order, there is again no special code for NAs because the
> >> sentinel is just another value.   Binary search in a column containing NA
> >> is again very convenient (no special code for NA) since INT_MIN comes 
> >> first.
> >>
> >> In case it wasn't clear:  I would really like to use Arrow.
> >>
> > Database systems in general find sentinel values to be unacceptable
> > from a data representation point of view, as they need to be able to
> > distinguish null from any single value. We need Arrow to be a
> > technology that can be adopted widely by database systems, data
> > science tools, and other systems that process data. I think that if we
> > have to expend some extra effort in some cases in the interest of
> > creating an inclusive, widely applicable and adopted technology, that
> > is a worthwhile compromise.
> >
> >>>> On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <cpcl...@gmail.com> wrote:
> >>>>
> >>>>> There is one database that I'm aware of that uses sentinels _and_
> >>> supports
> >>>>> complex types with missing values: Kx's KDB+. This has led to some
> >>>>> seriously strange choices like the ASCII space character being used as
> >>> the
> >>>>> sentinel value for strings. See
> >>>>> https://code.kx.com/wiki/Reference/Datatypes for
> >>>>> more details.
> >>>>>
> >>>>> On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>>>> hey Matt,
> >>>>>>
> >>>>>> Thanks for giving your perspective on the mailing list.
> >>>>>>
> >>>>>> My objective in writing about this recently
> >>>>>> (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
> >>>>>> need to update since the sentinel case can be done more efficiently
> >>>>>> than what's there now) was to help dispel the notion that using a
> >>>>>> separate value (bit or byte) to encode nullness is a performance
> >>>>>> compromise to comply with the requirements of database systems. I too
> >>>>>> prefer real world benchmarks to microbenchmarks, and probably null
> >>>>>> checking is not going to be the main driver of aggregate system
> >>>>>> performance. I had heard many people over the years object to bitmaps
> >>>>>> on performance grounds but without analysis to back it up.
> >>>>>>
> >>>>>> Some context for other readers on the mailing list: A language like R
> >>>>>> is not a database and has fewer built-in scalar types: int32, double,
> >>>>>> string (interned), and boolean. Out of these, int32 and double can
> >>> use
> >>>>>> one bit pattern for NA (null) and not lose too much. A database
> >>> system
> >>>>>> generally can't make that kind of compromise, and most popular
> >>>>>> databases can distinguish INT32_MIN (or any other value used as a
> >>>>>> sentinel) and null. If you loaded data from an Avro or Parquet file
> >>>>>> that contained one of those values, you'd have to decide what to do
> >>>>>> with the data (though I understand there's integer64 add-on packages
> >>>>>> for R now)
> >>>>>>
> >>>>>> Now back to Arrow -- we have 3 main kinds of data types:
> >>>>>>
> >>>>>> * Fixed size primitive
> >>>>>> * Variable size primitive (binary, utf8)
> >>>>>> * Nested (list, struct, union)
> >>>>>>
> >>>>>> Out of these, "fixed size primitive" is the only one that can
> >>>>>> generally support O(1) in-place mutation / updates, though all of
> >>> them
> >>>>>> could support a O(1) "make null" operation (by zeroing a bit). In
> >>>>>> general, when faced with designs we have preferred choices benefiting
> >>>>>> use cases where datasets are treated as immutable or copy-on-write.
> >>>>>>
> >>>>>> If an application _does_ need to do mutation on primitive arrays,
> >>> then
> >>>>>> you could choose to always allocate the validity bitmap so that it
> >>> can
> >>>>>> be mutated without requiring allocations to happen arbitrarily in
> >>> your
> >>>>>> processing workflow. But, if you have data without nulls, it is a
> >>> nice
> >>>>>> feature to be able to ignore the bitmap or not allocate one at all.
> >>> If
> >>>>>> you constructed an array from data that you know to be non-nullable,
> >>>>>> some implementations might wish to avoid the waste of creating a
> >>>>>> bitmap with all 1's.
> >>>>>>
> >>>>>> For example, if we create an array::Array from a normal NumPy array
> >>> of
> >>>>>> integers (which cannot have nulls), we have
> >>>>>>
> >>>>>> In [6]: import pyarrow as pa
> >>>>>> In [7]: import numpy as np
> >>>>>> In [8]: arr = pa.array(np.array([1, 2, 3, 4]))
> >>>>>>
> >>>>>> In [9]: arr.buffers()
> >>>>>> Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]
> >>>>>>
> >>>>>> In [10]: arr.null_count
> >>>>>> Out[10]: 0
> >>>>>>
> >>>>>> Normally, the first buffer would be the validity bitmap memory, but
> >>>>>> here it was not allocated because there are no nulls.
> >>>>>>
> >>>>>> Creating an open standard data representation is a difficult thing;
> >>>>>> one cannot be "all things to all people" but the intent is to be a
> >>>>>> suitable lingua franca for language agnostic data interchange and as
> >>> a
> >>>>>> runtime representation for analytical query engines (where most
> >>>>>> operators are "pure"). If the Arrow community's goal were to create a
> >>>>>> "mutable column store" then some things might be designed differently
> >>>>>> (perhaps more like internals of https://kudu.apache.org/). It is
> >>>>>> helpful to have an understanding of what compromises have been made
> >>>>>> and how costly they are in real world applications.
> >>>>>>
> >>>>>> best
> >>>>>> Wes
> >>>>>> On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org>
> >>>>> wrote:
> >>>>>>> On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com>
> >>>>> wrote:
> >>>>>>>> 1. I see. Good idea. Can we assume bitmap is always present in
> >>> Arrow
> >>>>>> then?
> >>>>>>>> I thought I'd seen Wes argue that if there were no NAs, the
> >>> bitmap
> >>>>>> doesn't
> >>>>>>>> need to be allocated.  Indeed I wasn't worried about the extra
> >>>>> storage,
> >>>>>>>> although for 10,000 columns I wonder about the number of vectors.
> >>>>>>>>
> >>>>>>> I think different implementations handle this differently at the
> >>>>> moment.
> >>>>>> In
> >>>>>>> the Java code, we allocate the validity buffer at initial
> >>> allocation
> >>>>>>> always. We're also looking to enhance the allocation strategy so
> >>> the
> >>>>>> fixed
> >>>>>>> part of values are always allocated with validity (single
> >>> allocation)
> >>>>> to
> >>>>>>> avoid any extra object housekeeping.
> >>>>>>>
> >>>>>>>
> >>>>>>>> 2. It's only subjective until the code complexity is measured,
> >>> then
> >>>>>> it's
> >>>>>>>> not subjective. I suppose after 20 years of using sentinels, I'm
> >>> used
> >>>>>> to it
> >>>>>>>> and trust it. I'll keep an open mind on this.
> >>>>>>>>
> >>>>>>> Yup, fair enough.
> >>>>>>>
> >>>>>>>
> >>>>>>>> 3. Since I criticized the scale of Wes' benchmark, I felt I
> >>> should
> >>>>>> show how
> >>>>>>>> I do benchmarks myself to show where I'm coming from. Yes
> >>> none-null,
> >>>>>>>> some-null and all-null paths offer savings. But that's the same
> >>> under
> >>>>>> both
> >>>>>>>> sentinel and bitmap approaches. Under both approaches, you just
> >>> need
> >>>>> to
> >>>>>>>> know which case you're in. That involves storing the number of
> >>> NAs in
> >>>>>> the
> >>>>>>>> header/summary which can be done under both approaches.
> >>>>>>>>
> >>>>>>> The item we appreciate is that you can do a single comparison
> >>> every 64
> >>>>>>> values to determine which of the three cases you are in (make this
> >>> a
> >>>>>> local
> >>>>>>> decision). This means you don't have to do housekeeping ahead of
> >>> time.
> >>>>> It
> >>>>>>> also means that the window of choice is narrow, minimizing the
> >>> penalty
> >>>>> in
> >>>>>>> situations where you have rare invalid values (or rare valid
> >>> values).

Reply via email to