Hi Wes,

Thanks for the information.
I agree with you that we had better make this clear in the document, to
help users avoid unexpected behaviors.

Best,
Liya Fan

On Sun, Sep 1, 2019 at 7:17 AM Wes McKinney <wesmck...@gmail.com> wrote:

> Option 3 is the what the columnar specification currently intends, for
> the reasons that Jacques cites. In particular, a value can be made
> null only by altering the validity bitmap. It might be helpful to add
> some language to make clear that the contents "underneath" a null can
> be anything. The same is true of other memory layouts also, including
> primitive.
>
> On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > Hi Jacques and Ravindra,
> >
> > Thanks for your valuable feedback.
> >
> > Please let me talk more about contiguous memory:
> > For some operations (like memory segment comparison, hash code
> computation,
> > etc.), if we we chose option 1 or 2, we can get the result with a single
> > call, without any reference to the validity buffer.
> >
> > With option 3, we need to split the memory into continuous regions
> > separated by undefined regions (based on validity buffer), and then we
> > calculate the result for each region and finally combine them. This is
> less
> > efficient.
> >
> > Ravindra's idea sounds interesting, especially when most values are null
> or
> > non-null.
> >
> > What do you think?
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <ravin...@dremio.com>
> > wrote:
> >
> > > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <liya.fa...@gmail.com>
> wrote:
> > >
> > > > Dear all,
> > > >
> > > > In the discussion of this PR (
> https://github.com/apache/arrow/pull/5073
> > > ),
> > > > we are faced with a problem:
> > > >
> > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null
> value is
> > > > supposed to take no space in the data buffer. In particular, for a
> null
> > > > value, we have
> > > >
> > > > start index == end index
> > > >
> > > > Where start index and end index are the start/end positions of the
> value
> > > in
> > > > the data buffer. This problem is also related to the ListVector.
> > > >
> > > > However, it seems that for some scenarios, a null value can take
> > > non-empty
> > > > space (please see this comment
> > > >
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).
> > > >
> > > > Since this is an important issue, we should make it clear in the
> > > > specification. Otherwise, some unexpected problems may occur in
> client
> > > > code.
> > > >
> > > > It seems we are faced with 3 options:
> > > >
> > > > 1. a null value always takes no space.
> > > > 2. a null value can take non-empty space, and the content of the
> > > non-empty
> > > > space is always 0.
> > > > 3. a null value can take non-empty space, and the content of the
> > > non-empty
> > > > space is undefined.
> > > >
> > > > Option 1 makes the data buffer of a VariableWidthVector a continuous
> > > region
> > > > (not interleaved by undefined regions). So optimization can be
> applied.
> > >
> > > However, it may lead to memory copy/move (as indicated in the above
> comment
> > > >
> https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> > > >
> > > > Option 3 can address the above problem of memory copy/move. However,
> it
> > > > splits memory into un-continuous regions, so optimizations cannot be
> > > > performed. In addition, it may cause unexpected problems in client
> code.
> > > >
> > >
> > > We could still apply the optimisation for the contiguous "valid
> regions".
> > > eg. if the entire vector is valid (called array in cpp), then compare
> data
> > > buffers. If there are only two null entries in the vector, compare the
> > > three consecutive regions in the data buffer, ..
> > >
> > >
> > >
> > > >
> > > > Option 2 seems like a trade-off between the two. However, it is not
> > > > suitable for ListVector.
> > > >
> > > > Please give your valuable feedback.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > >
> > >
> > > --
> > > Thanks and regards,
> > > Ravindra.
> > >
>

Reply via email to