> I can tell you from working in a large advertising company, that it is
critical to have a reliable way to distinguish null from not null, and a
space had better be not null.
What did a single space mean that was so important not to be null?
Symbols are recommended in KDB rather than strings. In the symbol type,
null, "" and " " are distinct iirc. Strings are more for free format data
fields (fields you wouldn't normally group, filter or join on), like
varchar in SQL.
In R strings are interned and NA is different to "" and " ". So just
because KDB chose " " to be sentinel in the string type, and lets say that
is a bad choice for the sake of argument, that isn't an argument against
sentinel per se. For example, in Arrow, there is presumably an offset table
to know how long each string is (otherwise all strings would have to be
scanned for final \0, which would be slow). A value in the offset table
could represent NA rather than either the string contents itself or a whole
separate bitmap.

On Fri, Nov 9, 2018 at 2:33 PM Phillip Cloud <cpcl...@gmail.com> wrote:

> The point I was trying to argue is that making the choice of a value from a
> set of values to use as a sentinel has consequences that are worth thinking
> about. One consequence of choosing such a value is a (hypothetical) isspace
> and an isnull (called null in KDB+) function would give indistinguishable
> results. When using KDB+ I found it disturbing that "remove all null values
> from a column" is synonymous with "remove all strings whose value is a
> single space". You'd actually have to convert your string column to
> integers and do the comparison to the value 32 (assuming you're in ASCII or
> a subset of it, I'm not sure if other encodings represent spaces
> differently) to remove spaces. Even then, in KDB+, you'd also be removing
> nulls. This suggests to me that someone made an assumption at some point in
> KDB+ that spaces were rare enough that giving them a completely different
> meaning was "okay". I can tell you from working in a large advertising
> company, that it is critical to have a reliable way to distinguish null
> from not null, and a space had better be not null. It's possible that my
> knowledge about kdb+ is out of date, in which case I'd be happy to know
> that this behavior no longer exists in KDB+.
>
> On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <mattjdo...@gmail.com> wrote:
>
> > > There is one database that I'm aware of that uses sentinels _and_
> > supports complex types with missing values: Kx's KDB+.
> > I read this and was pleased that KDB is being used as a reference.  It
> is a
> > seriously good database: the gold-standard in many people's eyes.
> >
> > > This has led to some seriously strange choices like the ASCII space
> > character being used as the sentinel value for strings.
> > But then I saw this. Surely if sentinels are good enough for KDB then
> isn't
> > that a sign that sentinels are not as bad as this group fears?
> >
> > What about grouping and joining columns that contain NA?   Here's an
> > example from R data.table :
> >
> > > DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6)
> > > DT
> >        x     v
> >    <num> <int>
> > 1:     1     1
> > 2:     3     2
> > 3:     3     3
> > 4:    NA     4
> > 5:     1     5
> > 6:    NA     6
> > > DT[,sum(v),keyby=x]
> >        x    V1
> >    <num> <int>
> > 1:    NA    10
> > 2:     1     6
> > 3:     3     5
> >
> > The NAs are grouped as a distinct value and are not excluded for
> > statistical robustness reasons.  This is very easy to achieve efficiently
> > internally; in fact there is no special code to deal with the NA values
> > because they are just another distinct value (the sentinel).  In Arrow
> if a
> > bitmap is present, there would be more code needed to deal with the NAs
> > (either way: including the NA group or excluding the NA group), if I
> > understand correctly.
> >
> > On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <cpcl...@gmail.com> wrote:
> >
> > > There is one database that I'm aware of that uses sentinels _and_
> > supports
> > > complex types with missing values: Kx's KDB+. This has led to some
> > > seriously strange choices like the ASCII space character being used as
> > the
> > > sentinel value for strings. See
> > > https://code.kx.com/wiki/Reference/Datatypes for
> > > more details.
> > >
> > > On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > hey Matt,
> > > >
> > > > Thanks for giving your perspective on the mailing list.
> > > >
> > > > My objective in writing about this recently
> > > > (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
> > > > need to update since the sentinel case can be done more efficiently
> > > > than what's there now) was to help dispel the notion that using a
> > > > separate value (bit or byte) to encode nullness is a performance
> > > > compromise to comply with the requirements of database systems. I too
> > > > prefer real world benchmarks to microbenchmarks, and probably null
> > > > checking is not going to be the main driver of aggregate system
> > > > performance. I had heard many people over the years object to bitmaps
> > > > on performance grounds but without analysis to back it up.
> > > >
> > > > Some context for other readers on the mailing list: A language like R
> > > > is not a database and has fewer built-in scalar types: int32, double,
> > > > string (interned), and boolean. Out of these, int32 and double can
> use
> > > > one bit pattern for NA (null) and not lose too much. A database
> system
> > > > generally can't make that kind of compromise, and most popular
> > > > databases can distinguish INT32_MIN (or any other value used as a
> > > > sentinel) and null. If you loaded data from an Avro or Parquet file
> > > > that contained one of those values, you'd have to decide what to do
> > > > with the data (though I understand there's integer64 add-on packages
> > > > for R now)
> > > >
> > > > Now back to Arrow -- we have 3 main kinds of data types:
> > > >
> > > > * Fixed size primitive
> > > > * Variable size primitive (binary, utf8)
> > > > * Nested (list, struct, union)
> > > >
> > > > Out of these, "fixed size primitive" is the only one that can
> > > > generally support O(1) in-place mutation / updates, though all of
> them
> > > > could support a O(1) "make null" operation (by zeroing a bit). In
> > > > general, when faced with designs we have preferred choices benefiting
> > > > use cases where datasets are treated as immutable or copy-on-write.
> > > >
> > > > If an application _does_ need to do mutation on primitive arrays,
> then
> > > > you could choose to always allocate the validity bitmap so that it
> can
> > > > be mutated without requiring allocations to happen arbitrarily in
> your
> > > > processing workflow. But, if you have data without nulls, it is a
> nice
> > > > feature to be able to ignore the bitmap or not allocate one at all.
> If
> > > > you constructed an array from data that you know to be non-nullable,
> > > > some implementations might wish to avoid the waste of creating a
> > > > bitmap with all 1's.
> > > >
> > > > For example, if we create an array::Array from a normal NumPy array
> of
> > > > integers (which cannot have nulls), we have
> > > >
> > > > In [6]: import pyarrow as pa
> > > > In [7]: import numpy as np
> > > > In [8]: arr = pa.array(np.array([1, 2, 3, 4]))
> > > >
> > > > In [9]: arr.buffers()
> > > > Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>]
> > > >
> > > > In [10]: arr.null_count
> > > > Out[10]: 0
> > > >
> > > > Normally, the first buffer would be the validity bitmap memory, but
> > > > here it was not allocated because there are no nulls.
> > > >
> > > > Creating an open standard data representation is a difficult thing;
> > > > one cannot be "all things to all people" but the intent is to be a
> > > > suitable lingua franca for language agnostic data interchange and as
> a
> > > > runtime representation for analytical query engines (where most
> > > > operators are "pure"). If the Arrow community's goal were to create a
> > > > "mutable column store" then some things might be designed differently
> > > > (perhaps more like internals of https://kudu.apache.org/). It is
> > > > helpful to have an understanding of what compromises have been made
> > > > and how costly they are in real world applications.
> > > >
> > > > best
> > > > Wes
> > > > On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org>
> > > wrote:
> > > > >
> > > > > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com>
> > > wrote:
> > > > >
> > > > > > 1. I see. Good idea. Can we assume bitmap is always present in
> > Arrow
> > > > then?
> > > > > > I thought I'd seen Wes argue that if there were no NAs, the
> bitmap
> > > > doesn't
> > > > > > need to be allocated.  Indeed I wasn't worried about the extra
> > > storage,
> > > > > > although for 10,000 columns I wonder about the number of vectors.
> > > > > >
> > > > >
> > > > > I think different implementations handle this differently at the
> > > moment.
> > > > In
> > > > > the Java code, we allocate the validity buffer at initial
> allocation
> > > > > always. We're also looking to enhance the allocation strategy so
> the
> > > > fixed
> > > > > part of values are always allocated with validity (single
> allocation)
> > > to
> > > > > avoid any extra object housekeeping.
> > > > >
> > > > >
> > > > > > 2. It's only subjective until the code complexity is measured,
> then
> > > > it's
> > > > > > not subjective. I suppose after 20 years of using sentinels, I'm
> > used
> > > > to it
> > > > > > and trust it. I'll keep an open mind on this.
> > > > > >
> > > > > Yup, fair enough.
> > > > >
> > > > >
> > > > > > 3. Since I criticized the scale of Wes' benchmark, I felt I
> should
> > > > show how
> > > > > > I do benchmarks myself to show where I'm coming from. Yes
> > none-null,
> > > > > > some-null and all-null paths offer savings. But that's the same
> > under
> > > > both
> > > > > > sentinel and bitmap approaches. Under both approaches, you just
> > need
> > > to
> > > > > > know which case you're in. That involves storing the number of
> NAs
> > in
> > > > the
> > > > > > header/summary which can be done under both approaches.
> > > > > >
> > > > >
> > > > > The item we appreciate is that you can do a single comparison every
> > 64
> > > > > values to determine which of the three cases you are in (make this
> a
> > > > local
> > > > > decision). This means you don't have to do housekeeping ahead of
> > time.
> > > It
> > > > > also means that the window of choice is narrow, minimizing the
> > penalty
> > > in
> > > > > situations where you have rare invalid values (or rare valid
> values).
> > > >
> > >
> >
>

Reply via email to