Hi, (First post to this mailing list.) I tweeted here and Wes invited me to follow up on this list : https://twitter.com/wesmckinn/status/1059440916987961346
Wes - it was great to meet you at Stanford in September. There I mentioned the assign/update aspect which is a downside of bitmap for NA, imho. I did see your recent article but I didn't see the update aspect mentioned there. It's rarely about speed for me and I'm uncomfortable when the longest runtime presented is just 0.04s. The concern surrounds updating (UPDATE in SQL, sub-assign in Python/R). If there are no NAs in the column, the first assignment of an NA has to allocate the bitmap vector and link that properly to the column. This allocation could fail (albeit rarely) due to out-of-memory so code has to exist to deal with that properly, from multi-threaded code too. I mentioned cache lines in the tweet not to focus on speed, but to focus on the complexity of updating the several places correctly from multi-threaded code: updating the NA count, the bitmap value and the column itself is 3 assigns all to be correctly wrapped in one critical section? In contrast, the sentinel approach (INT_MIN) is one assign to one place. That's simpler. There's less scope for a corruption since it isn't possible for a junk value to be wrongly included due to the bitmap values not being set properly, somehow. Simpler means less code. When we're talking about multi-threaded update to memory-mapped shared data, small simplifications like this (sentinel) can help a lot for robustness, safety and correctness while keeping the internal code required to achieve that to a minimum. Here is an example benchmark which I consider more relevant to make design decisions based on: https://h2oai.github.io/db-benchmark/. More relevant because the scale is in minutes and not-working. It's this not-working aspect that has often been in my mind when making data.table more memory efficient, for example. Not really raw sub-second speed per se. Bitmap for NA has a higher chance of resulting in instability (crashes or incorrect results) than sentinel simply due to there being more parts (three things to keep in sync when updating, including a possible allocate): more to go wrong at the low level. No? I'm interested to hear further and I would like to use Arrow. I really would. But this is a major sticking point. Best, Matt