Some inline thoughts On Sat, Feb 20, 2016 at 5:16 PM, Julian Hyde <jh...@apache.org> wrote: > Have you considered representing nulls using a placeholder value? For values > with 4 or more bits, there is at least one value that doesn’t occur in a > given batch. That value could be used as the null placeholder. Then you don’t > need a null vector. >
Using any kind of null sentinel / placeholder is tricky. Using a constant sentinel is not desirable as it limits your ability to fully represent the full range of values within primitive C types (also, what sentinel do you use for variable-length / list types?). Having a dynamic sentinel is interesting, but consider: - How do you know what value to use in incremental builder classes? - If two arrays of the same type use different, sentinels, concatenation may require non-trivial computation beyond a memcpy - Bitmasks allow you to use hardware popcount to determine if a range of values contains any nulls at all, which is much faster than value-by-value comparisons I haven't done a microbenchmark to understand the performance implications of primitive value comparison versus bit checking; that would be useful information to have. > For small values such as byte and short, it’s possible that the value space > is filled up, in which case you could move to the next size up. > In general, many applications may not find this palatable, as data already stored in a particular integer type (e.g. int8) would have to be upcasted if it is deemed to not "fit" (determine this could itself be computationally intensive). > Representing nulls this way would save allocating an extra piece of memory, > using up another cache line, and, I imagine could quite often reduce the > number of branch instructions used: you could check ‘v[i] < 10’ rather than > ‘v[i] < 10 and !isnull[i]’. > > Or maybe this kind of null compression (and other kinds of compression like > dictionary encoding and using non-multiples of 8 bits) belong at the > application layer? > Perhaps so -- several people have asked me about dictionary encoding in the context of Arrow. In that particular case, a dictionary-encoded array is really two arrays -- an integer index array plus the dictionary array -- interpreted as an array of the logical type of the dictionary. That seems to me like a problem of the metadata or application domain. Of course, we can provide fast algorithms within Arrow for dictionary encoding and other such standard operations. > Julian > > >> On Feb 20, 2016, at 12:41 PM, Wes McKinney <w...@cloudera.com> wrote: >> >> Since the data is immutable, not having to ever recompute the null >> count has a lot of benefits -- I find that sum(isnull(x)) is a >> commonly examined statistic. I'm expecting we'll soon have >> hardware-accelerated popcount available in the implementations, so >> that even the null-aware code paths can cheaply skip swaths of all >> null data. >> >> On Sat, Feb 20, 2016 at 12:31 PM, Jacques Nadeau <jacq...@apache.org> wrote: >>> I think part of it comes from the headache of all the types, etc. You get >>> into some sort of permutation fatigue :) >>> >>> bool v. integer: >>> >>> Interesting question. Haven't thought a lot about it but it seems like >>> cardinality of nulls could be a useful metric to decide algorithms. Not >>> sure that it is more expensive to maintain. >>> >>> On Sat, Feb 20, 2016 at 12:12 PM, Daniel Robinson <danrobinson...@gmail.com> >>> wrote: >>> >>>> I yield to your judgment from experience. Although I am surprised it >>>> wouldn't simplify the code in this case, since you will use the same >>>> algorithms on (and, if you never allocate a bitmask as Wes suggested, use >>>> the same data structures for) "nullable arrays" with null_count 0 as for >>>> non-nullable arrays. And formally I think it makes sense to have a nullable >>>> Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>, >>>> much like its values have either type T or null. >>>> >>>> Is there a reason to make null_count an integer? Or could it just be a bool >>>> that keeps track of whether there are nulls or not? >>>> >>>> >>>> On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacq...@apache.org> >>>> wrote: >>>> >>>>> Makes sense >>>>> >>>>> On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <w...@cloudera.com> wrote: >>>>> >>>>>> My expectation would be that data without nulls (as with required >>>>>> types) would typically not have the null bitmap allocated at, but this >>>>>> would be implementation dependent. For example, in builder classes, >>>>>> the first time a null is appended, the null bitmap could be allocated. >>>>>> >>>>>> In an IPC / wire protocol context, there would be no reason to send >>>>>> extra bits when the null count is 0 -- the data receiver, based on >>>>>> their implementation, could decide whether or not to allocate a bitmap >>>>>> based on that information. Since the data structures are intended as >>>>>> immutable, there is no specific need (to create an all-0 bitmap). >>>>>> >>>>>> On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacq...@apache.org> >>>>>> wrote: >>>>>>> We actually started there (and in fact Drill existed there for the >>>> last >>>>>>> three years). However, more and more, me and other members of that >>>> team >>>>>>> have come to the conclusion that the additional complexity isn't >>>> worth >>>>>> the >>>>>>> extra level of code complication. By providing the null count we can >>>>>>> achieve the same level of efficiency (+/- carrying around an extra >>>>> bitmap >>>>>>> which is pretty nominal in the grand scheme of things). >>>>>>> >>>>>>> Another thought could be exposing nullability as a physical property >>>>> and >>>>>>> not have be part of the logical model. That being said, I don't think >>>>> it >>>>>> is >>>>>>> worth the headache. >>>>>>> >>>>>>> On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson < >>>>>> danrobinson...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I like this proposal (as well as the rest of the spec so far!). But >>>>> why >>>>>>>> not go further and just store arrays that are nullable according to >>>>> the >>>>>>>> schema but have no nulls in them as "non-nullable" data >>>>> structures—i.e. >>>>>>>> structures that have no null bitmask? (After all, it would obviously >>>>> be >>>>>> a >>>>>>>> waste to allocate a null bitmask for arrays with null_count = 0.) So >>>>>> there >>>>>>>> will be two types on the data structure level, and two >>>> implementations >>>>>> of >>>>>>>> every algorithm, one for each of those types. >>>>>>>> >>>>>>>> If you do that, I'm not sure I see a reason for keeping track of >>>>>>>> null_count. Is there ever an efficiency gain from having that stored >>>>>> with >>>>>>>> an array? Algorithms that might introduce or remove nulls could just >>>>>> keep >>>>>>>> track of their own "null_count" that increments up from 0, and >>>> create >>>>> a >>>>>>>> no-nulls data structure if they never find one. >>>>>>>> >>>>>>>> I think this might also simplify the system interchange validation >>>>>> problem, >>>>>>>> since a system could just check the data-structure-level type of the >>>>>> input. >>>>>>>> (Although I'm not sure I understand why that would be necessary at >>>>>>>> "runtime.") >>>>>>>> >>>>>>>> Perhaps you should have different names for the data-structure-level >>>>>> types >>>>>>>> to distinguish them from the "nullable" and "non-nullable" types at >>>>> the >>>>>>>> schema level. (And also for philosophical reasons—since the arrays >>>> are >>>>>>>> immutable, "nullable" doesn't really have meaning on that level, >>>> does >>>>>> it?). >>>>>>>> "some_null" and "no_null"? Maybe "sparse" and "dense," although >>>> that >>>>>> too >>>>>>>> has a different meaning elsewhere in the spec... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com> >>>>>> wrote: >>>>>>>> >>>>>>>>> hi folks, >>>>>>>>> >>>>>>>>> welcome to all! It's great to see so many people excited about our >>>>>>>>> plans to make data systems faster and more interoperable. >>>>>>>>> >>>>>>>>> In thinking about building some initial Arrow integrations, I've >>>> run >>>>>>>>> into a couple of inter-related format questions. >>>>>>>>> >>>>>>>>> The first is a proposal to add a null count to Arrow arrays. With >>>>>>>>> optional/nullable data, null_count == 0 will allow algorithms to >>>>> skip >>>>>>>>> the null-handling code paths and treat the data as >>>>>>>>> required/non-nullable, yielding performance benefits. For example: >>>>>>>>> >>>>>>>>> if (arr->null_count() == 0) { >>>>>>>>> ... >>>>>>>>> } else { >>>>>>>>> ... >>>>>>>>> } >>>>>>>>> >>>>>>>>> Relatedly, at the data structure level, there is little semantic >>>>>>>>> distinction between these two cases >>>>>>>>> >>>>>>>>> - Required / Non-nullable arrays >>>>>>>>> - Optional arrays with null count 0 >>>>>>>>> >>>>>>>>> My thoughts are that "required-ness" would best be minded at the >>>>>>>>> metadata / schema level, rather than tasking the lowest tier of >>>> data >>>>>>>>> structures and algorithms with handling the two semantically >>>>> distinct, >>>>>>>>> but functionally equivalent forms of data without nulls. When >>>>>>>>> performing analytics, it adds complexity as some operations may >>>>>>>>> introduce or remove nulls, which would require type metadata to be >>>>>>>>> massaged i.e.: >>>>>>>>> >>>>>>>>> function(required input) -> optional output, versus >>>>>>>>> >>>>>>>>> function(input [null_count == 0]) -> output [maybe null_count > >>>> 0]. >>>>>>>>> >>>>>>>>> In the latter case, algorithms set bits and track the number of >>>>> nulls >>>>>>>>> while constructing the output Arrow array; the former adds some >>>>> extra >>>>>>>>> complexity. >>>>>>>>> >>>>>>>>> The question of course, is where to enforce "required" in data >>>>>>>>> interchange. If two systems have agreed (through exchange of >>>>>>>>> schemas/metadata) that a particular batch of Arrow data is >>>>>>>>> non-nullable, I would suggest that the null_count == 0 contract be >>>>>>>>> validated at that point. >>>>>>>>> >>>>>>>>> Curious to hear others' thoughts on this, and please let me know >>>> if >>>>> I >>>>>>>>> can clarify anything I've said here. >>>>>>>>> >>>>>>>>> best, >>>>>>>>> Wes >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >