Dear all,

In the discussion of this PR (https://github.com/apache/arrow/pull/5073),
we are faced with a problem:

Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is
supposed to take no space in the data buffer. In particular, for a null
value, we have

start index == end index

Where start index and end index are the start/end positions of the value in
the data buffer. This problem is also related to the ListVector.

However, it seems that for some scenarios, a null value can take non-empty
space (please see this comment
https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491).

Since this is an important issue, we should make it clear in the
specification. Otherwise, some unexpected problems may occur in client code.

It seems we are faced with 3 options:

1. a null value always takes no space.
2. a null value can take non-empty space, and the content of the non-empty
space is always 0.
3. a null value can take non-empty space, and the content of the non-empty
space is undefined.

Option 1 makes the data buffer of a VariableWidthVector a continuous region
(not interleaved by undefined regions). So optimization can be applied.
However, it may lead to memory copy/move (as indicated in the above comment
https://github.com/apache/arrow/pull/5073#pullrequestreview-274215491)

Option 3 can address the above problem of memory copy/move. However, it
splits memory into un-continuous regions, so optimizations cannot be
performed. In addition, it may cause unexpected problems in client code.

Option 2 seems like a trade-off between the two. However, it is not
suitable for ListVector.

Please give your valuable feedback.

Best,
Liya Fan

Reply via email to