Hi Wes, Thanks for the effort. I will add clarifications.
Best, Liya Fan On Wed, Sep 4, 2019 at 11:06 AM Wes McKinney <wesmck...@gmail.com> wrote: > I opened https://issues.apache.org/jira/browse/ARROW-6451 > > On Sun, Sep 1, 2019 at 9:59 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > > Hi Wes, > > > > Thanks for the information. > > I agree with you that we had better make this clear in the document, to > > help users avoid unexpected behaviors. > > > > Best, > > Liya Fan > > > > On Sun, Sep 1, 2019 at 7:17 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > Option 3 is the what the columnar specification currently intends, for > > > the reasons that Jacques cites. In particular, a value can be made > > > null only by altering the validity bitmap. It might be helpful to add > > > some language to make clear that the contents "underneath" a null can > > > be anything. The same is true of other memory layouts also, including > > > primitive. > > > > > > On Thu, Aug 29, 2019 at 12:50 AM Fan Liya <liya.fa...@gmail.com> > wrote: > > > > > > > > Hi Jacques and Ravindra, > > > > > > > > Thanks for your valuable feedback. > > > > > > > > Please let me talk more about contiguous memory: > > > > For some operations (like memory segment comparison, hash code > > > computation, > > > > etc.), if we we chose option 1 or 2, we can get the result with a > single > > > > call, without any reference to the validity buffer. > > > > > > > > With option 3, we need to split the memory into continuous regions > > > > separated by undefined regions (based on validity buffer), and then > we > > > > calculate the result for each region and finally combine them. This > is > > > less > > > > efficient. > > > > > > > > Ravindra's idea sounds interesting, especially when most values are > null > > > or > > > > non-null. > > > > > > > > What do you think? > > > > > > > > Best, > > > > Liya Fan > > > > > > > > On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura < > ravin...@dremio.com> > > > > wrote: > > > > > > > > > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya <liya.fa...@gmail.com> > > > wrote: > > > > > > > > > > > Dear all, > > > > > > > > > > > > In the discussion of this PR ( > > > https://github.com/apache/arrow/pull/5073 > > > > > ), > > > > > > we are faced with a problem: > > > > > > > > > > > > Normally, in a VariableWidthVector (e.g. VarCharVector), a null > > > value is > > > > > > supposed to take no space in the data buffer. In particular, for > a > > > null > > > > > > value, we have > > > > > > > > > > > > start index == end index > > > > > > > > > > > > Where start index and end index are the start/end positions of > the > > > value > > > > > in > > > > > > the data buffer. This problem is also related to the ListVector. > > > > > > > > > > > > However, it seems that for some scenarios, a null value can take > > > > > non-empty > > > > > > space (please see this comment > > > > > > > > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491 > ). > > > > > > > > > > > > Since this is an important issue, we should make it clear in the > > > > > > specification. Otherwise, some unexpected problems may occur in > > > client > > > > > > code. > > > > > > > > > > > > It seems we are faced with 3 options: > > > > > > > > > > > > 1. a null value always takes no space. > > > > > > 2. a null value can take non-empty space, and the content of the > > > > > non-empty > > > > > > space is always 0. > > > > > > 3. a null value can take non-empty space, and the content of the > > > > > non-empty > > > > > > space is undefined. > > > > > > > > > > > > Option 1 makes the data buffer of a VariableWidthVector a > continuous > > > > > region > > > > > > (not interleaved by undefined regions). So optimization can be > > > applied. > > > > > > > > > > However, it may lead to memory copy/move (as indicated in the above > > > comment > > > > > > > > > https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491) > > > > > > > > > > > > Option 3 can address the above problem of memory copy/move. > However, > > > it > > > > > > splits memory into un-continuous regions, so optimizations > cannot be > > > > > > performed. In addition, it may cause unexpected problems in > client > > > code. > > > > > > > > > > > > > > > > We could still apply the optimisation for the contiguous "valid > > > regions". > > > > > eg. if the entire vector is valid (called array in cpp), then > compare > > > data > > > > > buffers. If there are only two null entries in the vector, compare > the > > > > > three consecutive regions in the data buffer, .. > > > > > > > > > > > > > > > > > > > > > > > > > > > Option 2 seems like a trade-off between the two. However, it is > not > > > > > > suitable for ListVector. > > > > > > > > > > > > Please give your valuable feedback. > > > > > > > > > > > > Best, > > > > > > Liya Fan > > > > > > > > > > > > > > > > > > > > > -- > > > > > Thanks and regards, > > > > > Ravindra. > > > > > > > > >