Hi Wes,

Thanks for the effort. I will add clarifications.

Best,
Liya Fan

On Wed, Sep 4, 2019 at 11:06 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I opened https://issues.apache.org/jira/browse/ARROW-6451
>
> On Sun, Sep 1, 2019 at 9:59 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > Thanks for the information.
> > I agree with you that we had better make this clear in the document, to
> > help users avoid unexpected behaviors.
> >
> > Best,
> > Liya Fan
> >
> > On Sun, Sep 1, 2019 at 7:17 AM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > Option 3 is the what the columnar specification currently intends, for
> > > the reasons that Jacques cites. In particular, a value can be made
> > > null only by altering the validity bitmap. It might be helpful to add
> > > some language to make clear that the contents "underneath" a null can
> > > be anything. The same is true of other memory layouts also, including
> > > primitive.
> > >
> > > On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <liya.fa...@gmail.com>
> wrote:
> > > >
> > > > Hi Jacques and Ravindra,
> > > >
> > > > Thanks for your valuable feedback.
> > > >
> > > > Please let me talk more about contiguous memory:
> > > > For some operations (like memory segment comparison, hash code
> > > computation,
> > > > etc.), if we we chose option 1 or 2, we can get the result with a
> single
> > > > call, without any reference to the validity buffer.
> > > >
> > > > With option 3, we need to split the memory into continuous regions
> > > > separated by undefined regions (based on validity buffer), and then
> we
> > > > calculate the result for each region and finally combine them. This
> is
> > > less
> > > > efficient.
> > > >
> > > > Ravindra's idea sounds interesting, especially when most values are
> null
> > > or
> > > > non-null.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura <
> ravin...@dremio.com>
> > > > wrote:
> > > >
> > > > > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <liya.fa...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Dear all,
> > > > > >
> > > > > > In the discussion of this PR (
> > > https://github.com/apache/arrow/pull/5073
> > > > > ),
> > > > > > we are faced with a problem:
> > > > > >
> > > > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null
> > > value is
> > > > > > supposed to take no space in the data buffer. In particular, for
> a
> > > null
> > > > > > value, we have
> > > > > >
> > > > > > start index == end index
> > > > > >
> > > > > > Where start index and end index are the start/end positions of
> the
> > > value
> > > > > in
> > > > > > the data buffer. This problem is also related to the ListVector.
> > > > > >
> > > > > > However, it seems that for some scenarios, a null value can take
> > > > > non-empty
> > > > > > space (please see this comment
> > > > > >
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491
> ).
> > > > > >
> > > > > > Since this is an important issue, we should make it clear in the
> > > > > > specification. Otherwise, some unexpected problems may occur in
> > > client
> > > > > > code.
> > > > > >
> > > > > > It seems we are faced with 3 options:
> > > > > >
> > > > > > 1. a null value always takes no space.
> > > > > > 2. a null value can take non-empty space, and the content of the
> > > > > non-empty
> > > > > > space is always 0.
> > > > > > 3. a null value can take non-empty space, and the content of the
> > > > > non-empty
> > > > > > space is undefined.
> > > > > >
> > > > > > Option 1 makes the data buffer of a VariableWidthVector a
> continuous
> > > > > region
> > > > > > (not interleaved by undefined regions). So optimization can be
> > > applied.
> > > > >
> > > > > However, it may lead to memory copy/move (as indicated in the above
> > > comment
> > > > > >
> > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)
> > > > > >
> > > > > > Option 3 can address the above problem of memory copy/move.
> However,
> > > it
> > > > > > splits memory into un-continuous regions, so optimizations
> cannot be
> > > > > > performed. In addition, it may cause unexpected problems in
> client
> > > code.
> > > > > >
> > > > >
> > > > > We could still apply the optimisation for the contiguous "valid
> > > regions".
> > > > > eg. if the entire vector is valid (called array in cpp), then
> compare
> > > data
> > > > > buffers. If there are only two null entries in the vector, compare
> the
> > > > > three consecutive regions in the data buffer, ..
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Option 2 seems like a trade-off between the two. However, it is
> not
> > > > > > suitable for ListVector.
> > > > > >
> > > > > > Please give your valuable feedback.
> > > > > >
> > > > > > Best,
> > > > > > Liya Fan
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and regards,
> > > > > Ravindra.
> > > > >
> > >
>

Reply via email to