Generally Ive found that this isnt an important optimization in the use
cases we see. Memory overhead, especially with our Java shared allocation
scheme is nominal. Optimizing null checks at the word level usually is much
more impactful since non null and null runs are much more common on a
shorter window common than they seeing those declared.

In other words, I'd suggest you look at your problem with a broader
perspective and see whether you're actually focused on optimizing the most
important dimension.

As an aside, the original Arrow Java code actually treated these concepts
more distinctly and we consciously made a decision to collapse them to
simplify real world use.

I do think it makes to add a dirty read interface if you want. This would
allow consumers of the interface to behave efficiently if they wanted to.

One last note, efficient evaluation and processing should generally always
work at the validity word level. Adding an extra if check at the word
versus extra complexity of having an early out per batch seems like a
pretty small life in the grand scheme of processing.

On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hulet...@gmail.com> wrote:

> > And there is a "nullable" metadata-only flag at the
> > Field level. Could the same kinds of optimizations be implemented in
> > Java without introducing a "nullable" concept?
>
> Note Liya Fan did suggest pulling the nullable flag from the Field when the
> vector is created in item (1) of the proposed changes.
>
> Brian
>
> On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <liya.fa...@gmail.com> wrote:
>
> > Hi Micah,
> >
> > Thanks a lot for your valuable comments. Please see my comments inline.
> >
> > > I'm a little concerned that this will change assumptions for at least
> > some
> > > of the clients using the library (some might always rely on the
> validity
> > > buffer being present).
> >
> > I can understand your concern and I am also concerned.
> > IMO, the client should not depend on this assumption, as the
> specification
> > says "Arrays having a 0 null count may choose to not allocate the
> validity
> > bitmap." [1]
> > That being said, I think it would be safe to provide a global flag to
> > switch on/off the feature (as you suggested).
> >
> > > I think this is a good feature to have for the reasons you mentioned.
> It
> > > seems like there would need to be some sort of configuration bit to set
> > for
> > > this behavior.
> >
> > Good suggestion. We should be able to switch on and off the feature with
> a
> > single global flag.
> >
> > > But, I'd be worried about code complexity this would
> > > introduce.
> >
> > I agree with you that code complexity is an important factor to consider.
> > IMO, our proposal should not involve too much code change, or increase
> code
> > complexity too much.
> > To prove this, maybe we need to show some small experimental code change.
> >
> > Best,
> > Liya Fan
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
> >
> > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hi Liya Fan,
> > > I'm a little concerned that this will change assumptions for at least
> > some
> > > of the clients using the library (some might always rely on the
> validity
> > > buffer being present).
> > >
> > > I think this is a good feature to have for the reasons you mentioned.
> It
> > > seems like there would need to be some sort of configuration bit to set
> > for
> > > this behavior. But, I'd be worried about code complexity this would
> > > introduce.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <liya.fa...@gmail.com> wrote:
> > >
> > > > Hi Wes,
> > > >
> > > > Thanks a lot for your quick reply.
> > > > I think what you mentioned is almost exactly what we want to do in
> > > Java.The
> > > > concept is not important.
> > > >
> > > > Maybe there are only some minor differences:
> > > > 1. In C++, the null_count is mutable, while for Java, once a vector
> is
> > > > constructed as non-nullable, its null count can only be 0.
> > > > 2. In C++, a non-nullable array's validity buffer is null, while in
> > Java,
> > > > the buffer is an empty buffer, and cannot be changed.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >
> > > > > hi Liya,
> > > > >
> > > > > In C++ we elect certain faster code paths when the null count is 0
> or
> > > > > computed to be zero. When the null count is 0, we do not allocate a
> > > > > validity bitmap. And there is a "nullable" metadata-only flag at
> the
> > > > > Field level. Could the same kinds of optimizations be implemented
> in
> > > > > Java without introducing a "nullable" concept?
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <liya.fa...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Dear all,
> > > > > >
> > > > > > A non-nullable vector is one that is guaranteed to contain no
> > nulls.
> > > We
> > > > > > want to support non-nullable vectors in Java.
> > > > > >
> > > > > > *Motivations:*
> > > > > > 1. It is widely used in practice. For example, in a database
> > engine,
> > > a
> > > > > > column can be declared as not null, so it cannot contain null
> > values.
> > > > > > 2.Non-nullable vectors has significant performance advantages
> > > compared
> > > > > with
> > > > > > their nullable conterparts, such as:
> > > > > >   1) the memory space of the validity buffer can be saved.
> > > > > >   2) manipulation of the validity buffer can be bypassed
> > > > > >   3) some if-else branches can be replaced by sequential
> > instructions
> > > > (by
> > > > > > the JIT compiler), leading to high throughput for the CPU
> pipeline.
> > > > > >
> > > > > > *Potential Cost:*
> > > > > > For nullable vectors, there can be extra checks against the
> > > > nullablility
> > > > > > flag. So we must change the code in a way that minimizes the
> cost.
> > > > > >
> > > > > > *Proposed Changes:*
> > > > > > 1. There is no need to create new vector classes. We add a final
> > > > boolean
> > > > > to
> > > > > > the vector base classes as the nullability flag. The value of the
> > > flag
> > > > > can
> > > > > > be obtained from the field when creating the vector.
> > > > > > 2. Add a method "boolean isNullable()" to the root interface
> > > > ValueVector.
> > > > > > 3. If a vector is non-nullable, its validity buffer should be an
> > > empty
> > > > > > buffer (not null, so much of the existing logic can be left
> > > unchanged).
> > > > > > 4. For operations involving validity buffers (e.g. isNull, get,
> > set),
> > > > we
> > > > > > use the nullability flag to bypass manipulations to the validity
> > > > buffer.
> > > > > >
> > > > > > Therefore, it should be possible to support the feature with
> small
> > > code
> > > > > > changes.
> > > > > >
> > > > > > BTW, please note that similar behaviors have already been
> supported
> > > in
> > > > > C++.
> > > > > >
> > > > > > Would you please give your valueable feedback?
> > > > > >
> > > > > > Best,
> > > > > > Liya Fan
> > > > >
> > > >
> > >
> >
>

Reply via email to