A few random thoughts...

It seems like you outline three things as possible concerns:

1) Late allocation failure
2) Code complexity, errors.
3) Performance

For 1: is the validity bit sizing really too expensive to allocate on
initial allocation? In our experience, it is nominal in relationship to the
size of the rest of the data. Is your data different than this? If you
assume an initial allocation rather than a lazy allocation, your concern of
late failure go away. (If you want to pre-initialize to all set instead of
not-set by default, that seems like a minor memory clearing change.)

For 2: I think this is very subjective. Having code that checks for a
particular sentinel and then has special handling for it seems much more
complex/error prone to me than keeping validity information completely
independent of value storage. In effective, sentinel values actually
compound two completely independent concerns. This becomes even more
complex in situations where sentinels can change depending on the range of
possible data.

For 3: I'm not sure how the db-benchmark is helpful for this discussion
since it isn't granular enough to clearly understand the difference of this
particular design choice. I can assure you that the ability to eliminate
large amount of branches and operations in a null-if-null complex
expression tree pattern made substantial performance improvements in
Gandiva. Even in Dremio, we've specialized for sub-segments of data where
you have three custom paths: none-null, some-null, all-null. Being able to
do that without having to read each value individually can be the largest
of all wins, especially where many people declare all data as nullable but
rarely use that nullability.






On Mon, Nov 5, 2018 at 2:31 PM Matt Dowle <mattjdo...@gmail.com> wrote:

> Hi,
> (First post to this mailing list.)
> I tweeted here and Wes invited me to follow up on this list :
> https://twitter.com/wesmckinn/status/1059440916987961346
>
> Wes - it was great to meet you at Stanford in September. There I mentioned
> the assign/update aspect which is a downside of bitmap for NA, imho. I did
> see your recent article but I didn't see the update aspect mentioned there.
> It's rarely about speed for me and I'm uncomfortable when the longest
> runtime presented is just 0.04s.
>
> The concern surrounds updating (UPDATE in SQL, sub-assign in Python/R).  If
> there are no NAs in the column,  the first assignment of an NA has to
> allocate the bitmap vector and link that properly to the column. This
> allocation could fail (albeit rarely) due to out-of-memory so code has to
> exist to deal with that properly, from multi-threaded code too. I mentioned
> cache lines in the tweet not to focus on speed, but to focus on the
> complexity of updating the several places correctly from multi-threaded
> code: updating the NA count, the bitmap value and the column itself is 3
> assigns all to be correctly wrapped in one critical section?
>
> In contrast, the sentinel approach (INT_MIN) is one assign to one place.
> That's simpler. There's less scope for a corruption since it isn't possible
> for a junk value to be wrongly included due to the bitmap values not being
> set properly, somehow. Simpler means less code. When we're talking about
> multi-threaded update to memory-mapped shared data, small simplifications
> like this (sentinel) can help a lot for robustness, safety and correctness
> while keeping the internal code required to achieve that to a minimum.
>
> Here is an example benchmark which I consider more relevant to make design
> decisions based on: https://h2oai.github.io/db-benchmark/.  More relevant
> because the scale is in minutes and not-working. It's this not-working
> aspect that has often been in my mind when making data.table more memory
> efficient, for example. Not really raw sub-second speed per se.  Bitmap for
> NA has a higher chance of resulting in instability (crashes or incorrect
> results) than sentinel simply due to there being more parts (three things
> to keep in sync when updating, including a possible allocate): more to go
> wrong at the low level.
>
> No?  I'm interested to hear further and I would like to use Arrow. I really
> would. But this is a major sticking point.
>
> Best, Matt
>

Reply via email to