Re: Assign/update : NA bitmap vs sentinel

Wes McKinney Fri, 09 Nov 2018 15:54:54 -0800

hi Matt,
On Fri, Nov 9, 2018 at 6:36 PM Matt Dowle <[email protected]> wrote:
>
> On Fri, Nov 9, 2018 at 2:14 PM Wes McKinney <[email protected]> wrote:
>
> > On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <[email protected]> wrote:
> > >
> > > > There is one database that I'm aware of that uses sentinels _and_
> > > supports complex types with missing values: Kx's KDB+.
> > > I read this and was pleased that KDB is being used as a reference.  It
> > is a
> > > seriously good database: the gold-standard in many people's eyes.
> > >
> > > > This has led to some seriously strange choices like the ASCII space
> > > character being used as the sentinel value for strings.
> > > But then I saw this. Surely if sentinels are good enough for KDB then
> > isn't
> > > that a sign that sentinels are not as bad as this group fears?
> >
> > KDB has a good reputation in the financial world, but it is a very
> > niche product. I personally wouldn't draw any inferences about
> > database design from something with such a small and specialized
> > audience.
> >
>
> I find this view hard to understand. Why not draw some inference from a
> highly respected product.


To make an analogy, KDB is like a Formula 1 car. Formula 1 cars are
built to drive in a very particular way in a particular environment,
and I don't think it's representative of driving or car design in
general. KDB cannot be used interchangeably where PostgreSQL is used,
for example.

>
>
> >
> > >
> > > What about grouping and joining columns that contain NA?   Here's an
> > > example from R data.table :
> > >
> > > > DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6)
> > > > DT
> > >        x     v
> > >    <num> <int>
> > > 1:     1     1
> > > 2:     3     2
> > > 3:     3     3
> > > 4:    NA     4
> > > 5:     1     5
> > > 6:    NA     6
> > > > DT[,sum(v),keyby=x]
> > >        x    V1
> > >    <num> <int>
> > > 1:    NA    10
> > > 2:     1     6
> > > 3:     3     5
> > >
> > > The NAs are grouped as a distinct value and are not excluded for
> > > statistical robustness reasons.  This is very easy to achieve efficiently
> > > internally; in fact there is no special code to deal with the NA values
> > > because they are just another distinct value (the sentinel).  In Arrow
> > if a
> > > bitmap is present, there would be more code needed to deal with the NAs
> > > (either way: including the NA group or excluding the NA group), if I
> > > understand correctly.
> >
> > It depends on who's doing the analysis. Some database systems exclude
> > nulls in aggregations altogether. In others you indeed would need to
> > reserve an aggregation bucket to null and use the bitmap when
> > determining which bucket to update for each value
> >
>
> Then would you agree this is a downside of the bitmap approach?  Note that
> whether or not you are including or excluding the NA group, there is still
> more code to write with the bitmap approach. Under the sentinel approach
> not only is there nothing special to write,  but the NA group automatically
> comes first because it sorts first (INT_MIN).  When returning groups in
> first-appearance order, there is again no special code for NAs because the
> sentinel is just another value.   Binary search in a column containing NA
> is again very convenient (no special code for NA) since INT_MIN comes first.
>
> In case it wasn't clear:  I would really like to use Arrow.
>

Database systems in general find sentinel values to be unacceptable
from a data representation point of view, as they need to be able to
distinguish null from any single value. We need Arrow to be a
technology that can be adopted widely by database systems, data
science tools, and other systems that process data. I think that if we
have to expend some extra effort in some cases in the interest of
creating an inclusive, widely applicable and adopted technology, that
is a worthwhile compromise.

>
> > >
> > > On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <[email protected]> wrote:
> > >
> > > > There is one database that I'm aware of that uses sentinels _and_
> > supports
> > > > complex types with missing values: Kx's KDB+. This has led to some
> > > > seriously strange choices like the ASCII space character being used as
> > the
> > > > sentinel value for strings. See
> > > > https://code.kx.com/wiki/Reference/Datatypes for
> > > > more details.
> > > >
> > > > On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <[email protected]>
> > wrote:
> > > >
> > > > > hey Matt,
> > > > >
> > > > > Thanks for giving your perspective on the mailing list.
> > > > >
> > > > > My objective in writing about this recently
> > > > > (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
> > > > > need to update since the sentinel case can be done more efficiently
> > > > > than what's there now) was to help dispel the notion that using a
> > > > > separate value (bit or byte) to encode nullness is a performance
> > > > > compromise to comply with the requirements of database systems. I too
> > > > > prefer real world benchmarks to microbenchmarks, and probably null
> > > > > checking is not going to be the main driver of aggregate system
> > > > > performance. I had heard many people over the years object to bitmaps
> > > > > on performance grounds but without analysis to back it up.
> > > > >
> > > > > Some context for other readers on the mailing list: A language like R
> > > > > is not a database and has fewer built-in scalar types: int32, double,
> > > > > string (interned), and boolean. Out of these, int32 and double can
> > use
> > > > > one bit pattern for NA (null) and not lose too much. A database
> > system
> > > > > generally can't make that kind of compromise, and most popular
> > > > > databases can distinguish INT32_MIN (or any other value used as a
> > > > > sentinel) and null. If you loaded data from an Avro or Parquet file
> > > > > that contained one of those values, you'd have to decide what to do
> > > > > with the data (though I understand there's integer64 add-on packages
> > > > > for R now)
> > > > >
> > > > > Now back to Arrow -- we have 3 main kinds of data types:
> > > > >
> > > > > * Fixed size primitive
> > > > > * Variable size primitive (binary, utf8)
> > > > > * Nested (list, struct, union)
> > > > >
> > > > > Out of these, "fixed size primitive" is the only one that can
> > > > > generally support O(1) in-place mutation / updates, though all of
> > them
> > > > > could support a O(1) "make null" operation (by zeroing a bit). In
> > > > > general, when faced with designs we have preferred choices benefiting
> > > > > use cases where datasets are treated as immutable or copy-on-write.
> > > > >
> > > > > If an application _does_ need to do mutation on primitive arrays,
> > then
> > > > > you could choose to always allocate the validity bitmap so that it
> > can
> > > > > be mutated without requiring allocations to happen arbitrarily in
> > your
> > > > > processing workflow. But, if you have data without nulls, it is a
> > nice
> > > > > feature to be able to ignore the bitmap or not allocate one at all.
> > If
> > > > > you constructed an array from data that you know to be non-nullable,
> > > > > some implementations might wish to avoid the waste of creating a
> > > > > bitmap with all 1's.
> > > > >
> > > > > For example, if we create an array::Array from a normal NumPy array
> > of
> > > > > integers (which cannot have nulls), we have
> > > > >
> > > > > In [6]: import pyarrow as pa
> > > > > In [7]: import numpy as np
> > > > > In [8]: arr = pa.array(np.array([1, 2, 3, 4]))
> > > > >
> > > > > In [9]: arr.buffers()
> > > > > Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]
> > > > >
> > > > > In [10]: arr.null_count
> > > > > Out[10]: 0
> > > > >
> > > > > Normally, the first buffer would be the validity bitmap memory, but
> > > > > here it was not allocated because there are no nulls.
> > > > >
> > > > > Creating an open standard data representation is a difficult thing;
> > > > > one cannot be "all things to all people" but the intent is to be a
> > > > > suitable lingua franca for language agnostic data interchange and as
> > a
> > > > > runtime representation for analytical query engines (where most
> > > > > operators are "pure"). If the Arrow community's goal were to create a
> > > > > "mutable column store" then some things might be designed differently
> > > > > (perhaps more like internals of https://kudu.apache.org/). It is
> > > > > helpful to have an understanding of what compromises have been made
> > > > > and how costly they are in real world applications.
> > > > >
> > > > > best
> > > > > Wes
> > > > > On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > 1. I see. Good idea. Can we assume bitmap is always present in
> > Arrow
> > > > > then?
> > > > > > > I thought I'd seen Wes argue that if there were no NAs, the
> > bitmap
> > > > > doesn't
> > > > > > > need to be allocated.  Indeed I wasn't worried about the extra
> > > > storage,
> > > > > > > although for 10,000 columns I wonder about the number of vectors.
> > > > > > >
> > > > > >
> > > > > > I think different implementations handle this differently at the
> > > > moment.
> > > > > In
> > > > > > the Java code, we allocate the validity buffer at initial
> > allocation
> > > > > > always. We're also looking to enhance the allocation strategy so
> > the
> > > > > fixed
> > > > > > part of values are always allocated with validity (single
> > allocation)
> > > > to
> > > > > > avoid any extra object housekeeping.
> > > > > >
> > > > > >
> > > > > > > 2. It's only subjective until the code complexity is measured,
> > then
> > > > > it's
> > > > > > > not subjective. I suppose after 20 years of using sentinels, I'm
> > used
> > > > > to it
> > > > > > > and trust it. I'll keep an open mind on this.
> > > > > > >
> > > > > > Yup, fair enough.
> > > > > >
> > > > > >
> > > > > > > 3. Since I criticized the scale of Wes' benchmark, I felt I
> > should
> > > > > show how
> > > > > > > I do benchmarks myself to show where I'm coming from. Yes
> > none-null,
> > > > > > > some-null and all-null paths offer savings. But that's the same
> > under
> > > > > both
> > > > > > > sentinel and bitmap approaches. Under both approaches, you just
> > need
> > > > to
> > > > > > > know which case you're in. That involves storing the number of
> > NAs in
> > > > > the
> > > > > > > header/summary which can be done under both approaches.
> > > > > > >
> > > > > >
> > > > > > The item we appreciate is that you can do a single comparison
> > every 64
> > > > > > values to determine which of the three cases you are in (make this
> > a
> > > > > local
> > > > > > decision). This means you don't have to do housekeeping ahead of
> > time.
> > > > It
> > > > > > also means that the window of choice is narrow, minimizing the
> > penalty
> > > > in
> > > > > > situations where you have rare invalid values (or rare valid
> > values).
> > > > >
> > > >
> >

Re: Assign/update : NA bitmap vs sentinel

Reply via email to