Re: Assign/update : NA bitmap vs sentinel

Paul Taylor Sat, 10 Nov 2018 13:43:48 -0800

While I'm not qualified to debate the merits of various physicalrepresentations inside databases, I would like to chime in from theperspective of both an Arrow contributor and architect of perhaps one ofthe more exotic applications of Arrow in the wild (client/server +JavaScript + GPUs + graphs).

As Wes mentioned, the Arrow design needs to accommodate a wide range ofuse-cases across analytic workflows. A premium has been placed onsimplicity and consistency, as design decisions related to on-the-wirerepresentation have an outsize impact on the complexity and architectureof the tools and products we all build with it. To rip off Einstein,Arrow should be as simple as it can be, but no simpler.

I can't speak for others here, but the questions I ask when evaluating aphysical format are things like:


1. What impact do these decisions have on streaming architectures, both
   on-device and on-the-wire?
2. How difficult is it to write optimized routines on massively
   parallel (either on or off-chip) architectures?
3. Does this add complexity to (already complicated) byte alignment
   requirements of non-CPU devices?
4. Is this straightforward to implement in multiple languages, and can
   I support the maintenance burden necessary to fix issues as they arise?

And specifically for handling nulls, the questions I'd pose are:

1. How should Arrow represent nulls in nested Struct, Map, or Union
   types, which don't allocate an underlying data buffer of their own?
2. If variable-width types encode their nulls in the offsets table, how
   does that impact thread divergence?
3. How should nulls be encoded in fixed-size binary columns?

A validity bitmap is a consistent approach to handle nulls across theprimitive, variable width, and nested types. It's been my experience inthe past that other representations may be simpler locally, but areeither more complex or outright preclude their use when the rest of thepipeline is taken into account.

Additionally, the validity bitmap doesn't preclude you from usingsentinels informally in your own systems. The Arrow spec leaves the databytes for null positions unspecified, which you could certainly fill inwith INT_MIN for any tasks where that's most efficient or convenient,and is very fast with a trivial bit of opencl or cuda (this is a commontechnique working with GPUs, as it's often faster to scan over astreaming dataset up front vs. incur the cost of branching or threaddivergence later). One might argue this is abusing the format, but fromthe perspective of shipping high-quality features quickly, I've foundArrow's flexibility in this area and others a net plus.


Best,
Paul

On 11/9/18 3:53 PM, Wes McKinney wrote:

hi Matt,
On Fri, Nov 9, 2018 at 6:36 PM Matt Dowle <mattjdo...@gmail.com> wrote:

On Fri, Nov 9, 2018 at 2:14 PM Wes McKinney <wesmck...@gmail.com> wrote:

On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <mattjdo...@gmail.com> wrote:

There is one database that I'm aware of that uses sentinels _and_

supports complex types with missing values: Kx's KDB+.
I read this and was pleased that KDB is being used as a reference.  It

is a

seriously good database: the gold-standard in many people's eyes.

This has led to some seriously strange choices like the ASCII space

character being used as the sentinel value for strings.
But then I saw this. Surely if sentinels are good enough for KDB then

isn't

that a sign that sentinels are not as bad as this group fears?

KDB has a good reputation in the financial world, but it is a very
niche product. I personally wouldn't draw any inferences about
database design from something with such a small and specialized
audience.

I find this view hard to understand. Why not draw some inference from a
highly respected product.

To make an analogy, KDB is like a Formula 1 car. Formula 1 cars are
built to drive in a very particular way in a particular environment,
and I don't think it's representative of driving or car design in
general. KDB cannot be used interchangeably where PostgreSQL is used,
for example.

What about grouping and joining columns that contain NA?   Here's an
example from R data.table :

DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6)
DT

        x     v
    <num> <int>
1:     1     1
2:     3     2
3:     3     3
4:    NA     4
5:     1     5
6:    NA     6

DT[,sum(v),keyby=x]

        x    V1
    <num> <int>
1:    NA    10
2:     1     6
3:     3     5

The NAs are grouped as a distinct value and are not excluded for
statistical robustness reasons.  This is very easy to achieve efficiently
internally; in fact there is no special code to deal with the NA values
because they are just another distinct value (the sentinel).  In Arrow

if a

bitmap is present, there would be more code needed to deal with the NAs
(either way: including the NA group or excluding the NA group), if I
understand correctly.

It depends on who's doing the analysis. Some database systems exclude
nulls in aggregations altogether. In others you indeed would need to
reserve an aggregation bucket to null and use the bitmap when
determining which bucket to update for each value

Then would you agree this is a downside of the bitmap approach?  Note that
whether or not you are including or excluding the NA group, there is still
more code to write with the bitmap approach. Under the sentinel approach
not only is there nothing special to write,  but the NA group automatically
comes first because it sorts first (INT_MIN).  When returning groups in
first-appearance order, there is again no special code for NAs because the
sentinel is just another value.   Binary search in a column containing NA
is again very convenient (no special code for NA) since INT_MIN comes first.

In case it wasn't clear:  I would really like to use Arrow.

Database systems in general find sentinel values to be unacceptable
from a data representation point of view, as they need to be able to
distinguish null from any single value. We need Arrow to be a
technology that can be adopted widely by database systems, data
science tools, and other systems that process data. I think that if we
have to expend some extra effort in some cases in the interest of
creating an inclusive, widely applicable and adopted technology, that
is a worthwhile compromise.

On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <cpcl...@gmail.com> wrote:

There is one database that I'm aware of that uses sentinels _and_

supports

complex types with missing values: Kx's KDB+. This has led to some
seriously strange choices like the ASCII space character being used as

the

sentinel value for strings. See
https://code.kx.com/wiki/Reference/Datatypes for
more details.

On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com>

wrote:

hey Matt,

Thanks for giving your perspective on the mailing list.

My objective in writing about this recently
(http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
need to update since the sentinel case can be done more efficiently
than what's there now) was to help dispel the notion that using a
separate value (bit or byte) to encode nullness is a performance
compromise to comply with the requirements of database systems. I too
prefer real world benchmarks to microbenchmarks, and probably null
checking is not going to be the main driver of aggregate system
performance. I had heard many people over the years object to bitmaps
on performance grounds but without analysis to back it up.

Some context for other readers on the mailing list: A language like R
is not a database and has fewer built-in scalar types: int32, double,
string (interned), and boolean. Out of these, int32 and double can

use

one bit pattern for NA (null) and not lose too much. A database

system

generally can't make that kind of compromise, and most popular
databases can distinguish INT32_MIN (or any other value used as a
sentinel) and null. If you loaded data from an Avro or Parquet file
that contained one of those values, you'd have to decide what to do
with the data (though I understand there's integer64 add-on packages
for R now)

Now back to Arrow -- we have 3 main kinds of data types:

* Fixed size primitive
* Variable size primitive (binary, utf8)
* Nested (list, struct, union)

Out of these, "fixed size primitive" is the only one that can
generally support O(1) in-place mutation / updates, though all of

them

could support a O(1) "make null" operation (by zeroing a bit). In
general, when faced with designs we have preferred choices benefiting
use cases where datasets are treated as immutable or copy-on-write.

If an application _does_ need to do mutation on primitive arrays,

then

you could choose to always allocate the validity bitmap so that it

can

be mutated without requiring allocations to happen arbitrarily in

your

processing workflow. But, if you have data without nulls, it is a

nice

feature to be able to ignore the bitmap or not allocate one at all.

If

you constructed an array from data that you know to be non-nullable,
some implementations might wish to avoid the waste of creating a
bitmap with all 1's.

For example, if we create an array::Array from a normal NumPy array

of

integers (which cannot have nulls), we have

In [6]: import pyarrow as pa
In [7]: import numpy as np
In [8]: arr = pa.array(np.array([1, 2, 3, 4]))

In [9]: arr.buffers()
Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]

In [10]: arr.null_count
Out[10]: 0

Normally, the first buffer would be the validity bitmap memory, but
here it was not allocated because there are no nulls.

Creating an open standard data representation is a difficult thing;
one cannot be "all things to all people" but the intent is to be a
suitable lingua franca for language agnostic data interchange and as

runtime representation for analytical query engines (where most
operators are "pure"). If the Arrow community's goal were to create a
"mutable column store" then some things might be designed differently
(perhaps more like internals of https://kudu.apache.org/). It is
helpful to have an understanding of what compromises have been made
and how costly they are in real world applications.

best
Wes
On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org>

wrote:

On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com>

wrote:

1. I see. Good idea. Can we assume bitmap is always present in

Arrow

then?

I thought I'd seen Wes argue that if there were no NAs, the

bitmap

doesn't

need to be allocated.  Indeed I wasn't worried about the extra

storage,

although for 10,000 columns I wonder about the number of vectors.

I think different implementations handle this differently at the

moment.

In

the Java code, we allocate the validity buffer at initial

allocation

always. We're also looking to enhance the allocation strategy so

the

fixed

part of values are always allocated with validity (single

allocation)

to

avoid any extra object housekeeping.

2. It's only subjective until the code complexity is measured,

then

it's

not subjective. I suppose after 20 years of using sentinels, I'm

used

to it

and trust it. I'll keep an open mind on this.

Yup, fair enough.

3. Since I criticized the scale of Wes' benchmark, I felt I

should

show how

I do benchmarks myself to show where I'm coming from. Yes

none-null,

some-null and all-null paths offer savings. But that's the same

under

both

sentinel and bitmap approaches. Under both approaches, you just

need

to

know which case you're in. That involves storing the number of

NAs in

the

header/summary which can be done under both approaches.

The item we appreciate is that you can do a single comparison

every 64

values to determine which of the three cases you are in (make this

local

decision). This means you don't have to do housekeeping ahead of

time.

It

also means that the window of choice is narrow, minimizing the

penalty

in

situations where you have rare invalid values (or rare valid

values).

Re: Assign/update : NA bitmap vs sentinel

Reply via email to