I suppose one common use case is materializing list columns after some
expanding operation like a join or unnest. That's a case where I could
imagine a lot of repetition of values. Haven't yet thought of common cases
where there is overlap but not full duplication, but am eager to hear any.

The dictionary encoding point Raphael makes is interesting, especially
given the existence of LargeList and FixedSizeList. For many operations, it
might make more sense to just compose those existing types.

IIUC the operations that would be unique to the ArrayView are ones altering
the shape. One could truncate each array to a certain length cheaply simply
by replacing the sizes buffer. Or perhaps there are interesting operations
on tensors that would benefit.

On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> Unless I am missing something, I think the selection use-case could be
> equally well served by a dictionary-encoded BinarArray/ListArray, and would
> have the benefit of not requiring any modifications to the existing format
> or kernels.
>
> The major additional flexibility of the proposed encoding would be
> permitting disjoint or overlapping ranges, are these common enough in
> practice to represent a meaningful bottleneck?
>
>
> On 26 April 2023 01:40:14 BST, David Li <lidav...@apache.org> wrote:
> >Is there a need for a 64-bit offsets version the same way we have List
> and LargeList?
> >
> >And just to be clear, the difference with List is that the lists don't
> have to be stored in their logical order (or in other words, offsets do not
> have to be nondecreasing and so we also need sizes)?
> >
> >On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> >> For context, there was some discussion on this back in [1].  At that
> time
> >> this was called "sequence view" but I do not like that name.  However,
> >> array-view array is a little confusing.  Given this is similar to list
> can
> >> we go with list-view array?
> >>
> >>> Thanks for the introduction. I'd be interested to hear about the
> >>> applications Velox has found for these vectors, and in what situations
> >> they
> >>> are useful. This could be contrasted with the current ListArray
> >>> implementations.
> >>
> >> I believe one significant benefit is that take (and by proxy, filter)
> and
> >> sort are O(# of items) with the proposed format and O(# of bytes) with
> the
> >> current format.  Jorge did some profiling to this effect in [1].
> >>
> >> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> >>
> >> On Tue, Apr 25, 2023 at 3:13 PM Will Jones <will.jones...@gmail.com>
> wrote:
> >>
> >>> Hi Felipe,
> >>>
> >>> Thanks for the introduction. I'd be interested to hear about the
> >>> applications Velox has found for these vectors, and in what situations
> they
> >>> are useful. This could be contrasted with the current ListArray
> >>> implementations.
> >>>
> >>> IIUC it would be fairly cheap to transform a ListArray to an
> ArrayView, but
> >>> expensive to go the other way.
> >>>
> >>> Best,
> >>>
> >>> Will Jones
> >>>
> >>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
> >>> felipe...@gmail.com> wrote:
> >>>
> >>> > Hi folks,
> >>> >
> >>> > I would like to start a public discussion on the inclusion of a new
> array
> >>> > format to Arrow — array-view array. The name is also up for debate.
> >>> >
> >>> > This format is inspired by Velox's ArrayVector format [1]. Logically,
> >>> this
> >>> > array represents an array of arrays. Each element is an array-view
> >>> (offset
> >>> > and size pair) that points to a range within a nested "values" array
> >>> > (called "elements" in Velox docs). The nested array can be of any
> type,
> >>> > which makes this format very flexible and powerful.
> >>> >
> >>> > [image: ../_images/array-vector.png]
> >>> > <https://facebookincubator.github.io/velox/_images/array-vector.png>
> >>> >
> >>> > I'm currently working on a C++ implementation and plan to work on a
> Go
> >>> > implementation to fulfill the two-implementations requirement for
> format
> >>> > changes.
> >>> >
> >>> > The draft design:
> >>> >
> >>> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes
> buffer]
> >>> > - 1 child array: "values" as an array of the type parameter
> >>> >
> >>> > validity_bitmap is used to differentiate between empty array views
> >>> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
> >>> >
> >>> > When the validity_bitmap[i] is 0, both sizes and offsets are
> undefined
> >>> (as
> >>> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is
> recommended
> >>> > if setting a value is not an issue to the system producing the
> arrays.
> >>> >
> >>> > offsets buffer is not required to be ordered and views don't have to
> be
> >>> > disjoint.
> >>> >
> >>> > [1]
> >>> >
> >>>
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> >>> >
> >>> > Thanks,
> >>> > Felipe O. Carvalho
> >>> >
> >>>
>

Reply via email to