While I'm not qualified to debate the merits of various physical
representations inside databases, I would like to chime in from the
perspective of both an Arrow contributor and architect of perhaps one of
the more exotic applications of Arrow in the wild (client/server +
JavaScript + GPUs + graphs).
As Wes mentioned, the Arrow design needs to accommodate a wide range of
use-cases across analytic workflows. A premium has been placed on
simplicity and consistency, as design decisions related to on-the-wire
representation have an outsize impact on the complexity and architecture
of the tools and products we all build with it. To rip off Einstein,
Arrow should be as simple as it can be, but no simpler.
I can't speak for others here, but the questions I ask when evaluating a
physical format are things like:
1. What impact do these decisions have on streaming architectures, both
on-device and on-the-wire?
2. How difficult is it to write optimized routines on massively
parallel (either on or off-chip) architectures?
3. Does this add complexity to (already complicated) byte alignment
requirements of non-CPU devices?
4. Is this straightforward to implement in multiple languages, and can
I support the maintenance burden necessary to fix issues as they arise?
And specifically for handling nulls, the questions I'd pose are:
1. How should Arrow represent nulls in nested Struct, Map, or Union
types, which don't allocate an underlying data buffer of their own?
2. If variable-width types encode their nulls in the offsets table, how
does that impact thread divergence?
3. How should nulls be encoded in fixed-size binary columns?
A validity bitmap is a consistent approach to handle nulls across the
primitive, variable width, and nested types. It's been my experience in
the past that other representations may be simpler locally, but are
either more complex or outright preclude their use when the rest of the
pipeline is taken into account.
Additionally, the validity bitmap doesn't preclude you from using
sentinels informally in your own systems. The Arrow spec leaves the data
bytes for null positions unspecified, which you could certainly fill in
with INT_MIN for any tasks where that's most efficient or convenient,
and is very fast with a trivial bit of opencl or cuda (this is a common
technique working with GPUs, as it's often faster to scan over a
streaming dataset up front vs. incur the cost of branching or thread
divergence later). One might argue this is abusing the format, but from
the perspective of shipping high-quality features quickly, I've found
Arrow's flexibility in this area and others a net plus.
Best,
Paul
On 11/9/18 3:53 PM, Wes McKinney wrote:
hi Matt,
On Fri, Nov 9, 2018 at 6:36 PM Matt Dowle <mattjdo...@gmail.com> wrote:
On Fri, Nov 9, 2018 at 2:14 PM Wes McKinney <wesmck...@gmail.com> wrote:
On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <mattjdo...@gmail.com> wrote:
There is one database that I'm aware of that uses sentinels _and_
supports complex types with missing values: Kx's KDB+.
I read this and was pleased that KDB is being used as a reference. It
is a
seriously good database: the gold-standard in many people's eyes.
This has led to some seriously strange choices like the ASCII space
character being used as the sentinel value for strings.
But then I saw this. Surely if sentinels are good enough for KDB then
isn't
that a sign that sentinels are not as bad as this group fears?
KDB has a good reputation in the financial world, but it is a very
niche product. I personally wouldn't draw any inferences about
database design from something with such a small and specialized
audience.
I find this view hard to understand. Why not draw some inference from a
highly respected product.
To make an analogy, KDB is like a Formula 1 car. Formula 1 cars are
built to drive in a very particular way in a particular environment,
and I don't think it's representative of driving or car design in
general. KDB cannot be used interchangeably where PostgreSQL is used,
for example.
What about grouping and joining columns that contain NA? Here's an
example from R data.table :
DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6)
DT
x v
<num> <int>
1: 1 1
2: 3 2
3: 3 3
4: NA 4
5: 1 5
6: NA 6
DT[,sum(v),keyby=x]
x V1
<num> <int>
1: NA 10
2: 1 6
3: 3 5
The NAs are grouped as a distinct value and are not excluded for
statistical robustness reasons. This is very easy to achieve efficiently
internally; in fact there is no special code to deal with the NA values
because they are just another distinct value (the sentinel). In Arrow
if a
bitmap is present, there would be more code needed to deal with the NAs
(either way: including the NA group or excluding the NA group), if I
understand correctly.
It depends on who's doing the analysis. Some database systems exclude
nulls in aggregations altogether. In others you indeed would need to
reserve an aggregation bucket to null and use the bitmap when
determining which bucket to update for each value
Then would you agree this is a downside of the bitmap approach? Note that
whether or not you are including or excluding the NA group, there is still
more code to write with the bitmap approach. Under the sentinel approach
not only is there nothing special to write, but the NA group automatically
comes first because it sorts first (INT_MIN). When returning groups in
first-appearance order, there is again no special code for NAs because the
sentinel is just another value. Binary search in a column containing NA
is again very convenient (no special code for NA) since INT_MIN comes first.
In case it wasn't clear: I would really like to use Arrow.
Database systems in general find sentinel values to be unacceptable
from a data representation point of view, as they need to be able to
distinguish null from any single value. We need Arrow to be a
technology that can be adopted widely by database systems, data
science tools, and other systems that process data. I think that if we
have to expend some extra effort in some cases in the interest of
creating an inclusive, widely applicable and adopted technology, that
is a worthwhile compromise.
On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <cpcl...@gmail.com> wrote:
There is one database that I'm aware of that uses sentinels _and_
supports
complex types with missing values: Kx's KDB+. This has led to some
seriously strange choices like the ASCII space character being used as
the
sentinel value for strings. See
https://code.kx.com/wiki/Reference/Datatypes for
more details.
On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com>
wrote:
hey Matt,
Thanks for giving your perspective on the mailing list.
My objective in writing about this recently
(http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
need to update since the sentinel case can be done more efficiently
than what's there now) was to help dispel the notion that using a
separate value (bit or byte) to encode nullness is a performance
compromise to comply with the requirements of database systems. I too
prefer real world benchmarks to microbenchmarks, and probably null
checking is not going to be the main driver of aggregate system
performance. I had heard many people over the years object to bitmaps
on performance grounds but without analysis to back it up.
Some context for other readers on the mailing list: A language like R
is not a database and has fewer built-in scalar types: int32, double,
string (interned), and boolean. Out of these, int32 and double can
use
one bit pattern for NA (null) and not lose too much. A database
system
generally can't make that kind of compromise, and most popular
databases can distinguish INT32_MIN (or any other value used as a
sentinel) and null. If you loaded data from an Avro or Parquet file
that contained one of those values, you'd have to decide what to do
with the data (though I understand there's integer64 add-on packages
for R now)
Now back to Arrow -- we have 3 main kinds of data types:
* Fixed size primitive
* Variable size primitive (binary, utf8)
* Nested (list, struct, union)
Out of these, "fixed size primitive" is the only one that can
generally support O(1) in-place mutation / updates, though all of
them
could support a O(1) "make null" operation (by zeroing a bit). In
general, when faced with designs we have preferred choices benefiting
use cases where datasets are treated as immutable or copy-on-write.
If an application _does_ need to do mutation on primitive arrays,
then
you could choose to always allocate the validity bitmap so that it
can
be mutated without requiring allocations to happen arbitrarily in
your
processing workflow. But, if you have data without nulls, it is a
nice
feature to be able to ignore the bitmap or not allocate one at all.
If
you constructed an array from data that you know to be non-nullable,
some implementations might wish to avoid the waste of creating a
bitmap with all 1's.
For example, if we create an array::Array from a normal NumPy array
of
integers (which cannot have nulls), we have
In [6]: import pyarrow as pa
In [7]: import numpy as np
In [8]: arr = pa.array(np.array([1, 2, 3, 4]))
In [9]: arr.buffers()
Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]
In [10]: arr.null_count
Out[10]: 0
Normally, the first buffer would be the validity bitmap memory, but
here it was not allocated because there are no nulls.
Creating an open standard data representation is a difficult thing;
one cannot be "all things to all people" but the intent is to be a
suitable lingua franca for language agnostic data interchange and as
a
runtime representation for analytical query engines (where most
operators are "pure"). If the Arrow community's goal were to create a
"mutable column store" then some things might be designed differently
(perhaps more like internals of https://kudu.apache.org/). It is
helpful to have an understanding of what compromises have been made
and how costly they are in real world applications.
best
Wes
On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org>
wrote:
On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com>
wrote:
1. I see. Good idea. Can we assume bitmap is always present in
Arrow
then?
I thought I'd seen Wes argue that if there were no NAs, the
bitmap
doesn't
need to be allocated. Indeed I wasn't worried about the extra
storage,
although for 10,000 columns I wonder about the number of vectors.
I think different implementations handle this differently at the
moment.
In
the Java code, we allocate the validity buffer at initial
allocation
always. We're also looking to enhance the allocation strategy so
the
fixed
part of values are always allocated with validity (single
allocation)
to
avoid any extra object housekeeping.
2. It's only subjective until the code complexity is measured,
then
it's
not subjective. I suppose after 20 years of using sentinels, I'm
used
to it
and trust it. I'll keep an open mind on this.
Yup, fair enough.
3. Since I criticized the scale of Wes' benchmark, I felt I
should
show how
I do benchmarks myself to show where I'm coming from. Yes
none-null,
some-null and all-null paths offer savings. But that's the same
under
both
sentinel and bitmap approaches. Under both approaches, you just
need
to
know which case you're in. That involves storing the number of
NAs in
the
header/summary which can be done under both approaches.
The item we appreciate is that you can do a single comparison
every 64
values to determine which of the three cases you are in (make this
a
local
decision). This means you don't have to do housekeeping ahead of
time.
It
also means that the window of choice is narrow, minimizing the
penalty
in
situations where you have rare invalid values (or rare valid
values).