Hi Liya Fan,
I'm a little concerned that this will change assumptions for at least some
of the clients using the library (some might always rely on the validity
buffer being present).

I think this is a good feature to have for the reasons you mentioned. It
seems like there would need to be some sort of configuration bit to set for
this behavior. But, I'd be worried about code complexity this would
introduce.

Thanks,
Micah

On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <liya.fa...@gmail.com> wrote:

> Hi Wes,
>
> Thanks a lot for your quick reply.
> I think what you mentioned is almost exactly what we want to do in Java.The
> concept is not important.
>
> Maybe there are only some minor differences:
> 1. In C++, the null_count is mutable, while for Java, once a vector is
> constructed as non-nullable, its null count can only be 0.
> 2. In C++, a non-nullable array's validity buffer is null, while in Java,
> the buffer is an empty buffer, and cannot be changed.
>
> Best,
> Liya Fan
>
> On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Liya,
> >
> > In C++ we elect certain faster code paths when the null count is 0 or
> > computed to be zero. When the null count is 0, we do not allocate a
> > validity bitmap. And there is a "nullable" metadata-only flag at the
> > Field level. Could the same kinds of optimizations be implemented in
> > Java without introducing a "nullable" concept?
> >
> > - Wes
> >
> > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <liya.fa...@gmail.com> wrote:
> > >
> > > Dear all,
> > >
> > > A non-nullable vector is one that is guaranteed to contain no nulls. We
> > > want to support non-nullable vectors in Java.
> > >
> > > *Motivations:*
> > > 1. It is widely used in practice. For example, in a database engine, a
> > > column can be declared as not null, so it cannot contain null values.
> > > 2.Non-nullable vectors has significant performance advantages compared
> > with
> > > their nullable conterparts, such as:
> > >   1) the memory space of the validity buffer can be saved.
> > >   2) manipulation of the validity buffer can be bypassed
> > >   3) some if-else branches can be replaced by sequential instructions
> (by
> > > the JIT compiler), leading to high throughput for the CPU pipeline.
> > >
> > > *Potential Cost:*
> > > For nullable vectors, there can be extra checks against the
> nullablility
> > > flag. So we must change the code in a way that minimizes the cost.
> > >
> > > *Proposed Changes:*
> > > 1. There is no need to create new vector classes. We add a final
> boolean
> > to
> > > the vector base classes as the nullability flag. The value of the flag
> > can
> > > be obtained from the field when creating the vector.
> > > 2. Add a method "boolean isNullable()" to the root interface
> ValueVector.
> > > 3. If a vector is non-nullable, its validity buffer should be an empty
> > > buffer (not null, so much of the existing logic can be left unchanged).
> > > 4. For operations involving validity buffers (e.g. isNull, get, set),
> we
> > > use the nullability flag to bypass manipulations to the validity
> buffer.
> > >
> > > Therefore, it should be possible to support the feature with small code
> > > changes.
> > >
> > > BTW, please note that similar behaviors have already been supported in
> > C++.
> > >
> > > Would you please give your valueable feedback?
> > >
> > > Best,
> > > Liya Fan
> >
>

Reply via email to