hey Matt,

Thanks for giving your perspective on the mailing list.

My objective in writing about this recently
(http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
need to update since the sentinel case can be done more efficiently
than what's there now) was to help dispel the notion that using a
separate value (bit or byte) to encode nullness is a performance
compromise to comply with the requirements of database systems. I too
prefer real world benchmarks to microbenchmarks, and probably null
checking is not going to be the main driver of aggregate system
performance. I had heard many people over the years object to bitmaps
on performance grounds but without analysis to back it up.

Some context for other readers on the mailing list: A language like R
is not a database and has fewer built-in scalar types: int32, double,
string (interned), and boolean. Out of these, int32 and double can use
one bit pattern for NA (null) and not lose too much. A database system
generally can't make that kind of compromise, and most popular
databases can distinguish INT32_MIN (or any other value used as a
sentinel) and null. If you loaded data from an Avro or Parquet file
that contained one of those values, you'd have to decide what to do
with the data (though I understand there's integer64 add-on packages
for R now)

Now back to Arrow -- we have 3 main kinds of data types:

* Fixed size primitive
* Variable size primitive (binary, utf8)
* Nested (list, struct, union)

Out of these, "fixed size primitive" is the only one that can
generally support O(1) in-place mutation / updates, though all of them
could support a O(1) "make null" operation (by zeroing a bit). In
general, when faced with designs we have preferred choices benefiting
use cases where datasets are treated as immutable or copy-on-write.

If an application _does_ need to do mutation on primitive arrays, then
you could choose to always allocate the validity bitmap so that it can
be mutated without requiring allocations to happen arbitrarily in your
processing workflow. But, if you have data without nulls, it is a nice
feature to be able to ignore the bitmap or not allocate one at all. If
you constructed an array from data that you know to be non-nullable,
some implementations might wish to avoid the waste of creating a
bitmap with all 1's.

For example, if we create an array::Array from a normal NumPy array of
integers (which cannot have nulls), we have

In [6]: import pyarrow as pa
In [7]: import numpy as np
In [8]: arr = pa.array(np.array([1, 2, 3, 4]))

In [9]: arr.buffers()
Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]

In [10]: arr.null_count
Out[10]: 0

Normally, the first buffer would be the validity bitmap memory, but
here it was not allocated because there are no nulls.

Creating an open standard data representation is a difficult thing;
one cannot be "all things to all people" but the intent is to be a
suitable lingua franca for language agnostic data interchange and as a
runtime representation for analytical query engines (where most
operators are "pure"). If the Arrow community's goal were to create a
"mutable column store" then some things might be designed differently
(perhaps more like internals of https://kudu.apache.org/). It is
helpful to have an understanding of what compromises have been made
and how costly they are in real world applications.

best
Wes
On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com> wrote:
>
> > 1. I see. Good idea. Can we assume bitmap is always present in Arrow then?
> > I thought I'd seen Wes argue that if there were no NAs, the bitmap doesn't
> > need to be allocated.  Indeed I wasn't worried about the extra storage,
> > although for 10,000 columns I wonder about the number of vectors.
> >
>
> I think different implementations handle this differently at the moment. In
> the Java code, we allocate the validity buffer at initial allocation
> always. We're also looking to enhance the allocation strategy so the fixed
> part of values are always allocated with validity (single allocation) to
> avoid any extra object housekeeping.
>
>
> > 2. It's only subjective until the code complexity is measured, then it's
> > not subjective. I suppose after 20 years of using sentinels, I'm used to it
> > and trust it. I'll keep an open mind on this.
> >
> Yup, fair enough.
>
>
> > 3. Since I criticized the scale of Wes' benchmark, I felt I should show how
> > I do benchmarks myself to show where I'm coming from. Yes none-null,
> > some-null and all-null paths offer savings. But that's the same under both
> > sentinel and bitmap approaches. Under both approaches, you just need to
> > know which case you're in. That involves storing the number of NAs in the
> > header/summary which can be done under both approaches.
> >
>
> The item we appreciate is that you can do a single comparison every 64
> values to determine which of the three cases you are in (make this a local
> decision). This means you don't have to do housekeeping ahead of time. It
> also means that the window of choice is narrow, minimizing the penalty in
> situations where you have rare invalid values (or rare valid values).

Reply via email to