Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Ian Cook Fri, 19 May 2023 09:09:22 -0700

That's great, thanks Brent. If possible could you share a specific
example of the operation you are referring to so that we can better
reason about how the ListView layout would help in this case?


Any additional input from the community providing specifics of
real-world workloads that are expected to benefit from the ListView
layout would be much appreciated.
.
Ian

On Mon, May 15, 2023 at 5:12 PM Brent Gardner
<[email protected]> wrote:
>
> For what it's worth, my company is building a database using arrow(rs) as
> an in memory storage format, and this feature would be very helpful because
> it would allow us to bitmask out mvcc rows that have been deleted / have
> not yet been committed / have been rolled back, etc.
>
> - Brent
>
> On Mon, May 15, 2023, 06:55 Ian Cook <[email protected]> wrote:
>
> > I think it would be easier for us all to weigh the costs and benefits
> > of adding this proposed ListView layout to the Arrow specification and
> > implementing it in the various Arrow libraries if we could all see
> > some benchmarks demonstrating the performance/efficiency benefits
> > compared to Arrow’s existing List layouts.
> >
> > Based on the Velox paper [1] and from conversations with the Velox
> > developers, I would anticipate that these benchmarks will show that
> > ListView confers substantial performance/efficiency benefits on some
> > workloads. I suggest conferring with the Velox developers to identify
> > benchmark workloads will best demonstrate the performance/efficiency
> > benefit of the ListView layout while representing common real-world
> > workloads.
> >
> > Ian
> >
> > [1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> >
> > On Sat, May 13, 2023 at 3:09 PM Andrew Lamb <[email protected]> wrote:
> > >
> > > I agree that it is hard to see  any compelling advantage of adopting
> > > ListView that would incentivize adding it to DataFusion.
> > >
> > > It also seems like the conversion requires changing only indexes (not the
> > > underlying data) so it would likely be relatively inexpensive I would
> > think
> > >
> > > On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
> > > <[email protected]> wrote:
> > >
> > > > Hi All,
> > > >
> > > > > if we added this, do we think many Arrow and query
> > > > > engine implementations (for example, DataFusion) will be eager to add
> > > > full
> > > > > support for the type, including compute kernels? Or are they likely
> > to
> > > > just
> > > > > convert this type to ListArray at import boundaries?
> > > > I can't speak for query engines in general, but at least for arrow-rs
> > > > and by extension DataFusion, and based on my current understanding of
> > > > the use-cases I would be rather hesitant to add support to the kernels
> > > > for this array type, definitely instead favouring conversion at the
> > > > edges. We already have issues with the amount of code generation
> > > > resulting in binary bloat and long compile times, and I worry this
> > would
> > > > worsen this situation whilst not really providing compelling advantages
> > > > for the vast majority of workloads that don't interact with Velox.
> > > > Whilst I can definitely see that the ListView representation is
> > probably
> > > > a better way to represent variable length lists than what arrow settled
> > > > upon, I'm not yet convinced it is sufficiently better to incentivise
> > > > broad ecosystem adoption.
> > > >
> > > > Kind Regards,
> > > >
> > > > Raphael Taylor-Davies
> > > >
> > > > On 11/05/2023 21:20, Will Jones wrote:
> > > > > Hi Felipe,
> > > > >
> > > > > Thanks for the additional details.
> > > > >
> > > > >
> > > > >> Velox kernels benefit from being able to append data to the array
> > from
> > > > >> different threads without care for strict ordering. Only the offsets
> > > > array
> > > > >> has to be written according to logical order but that is
> > potentially a
> > > > much
> > > > >> smaller buffer than the values buffer.
> > > > >>
> > > > > It still seems to me like applications are still pretty niche, as I
> > > > suspect
> > > > > in most cases the benefits are outweighed by the costs. The benefit
> > here
> > > > > seems pretty limited: if you are trying to split work between
> > threads,
> > > > > usually you will have other levels such as array chunks to
> > parallelize.
> > > > And
> > > > > if you have an incoming stream of row data, you'll want to append in
> > > > > predictable order to match the order of the other arrays. Am I
> > missing
> > > > > something?
> > > > >
> > > > > And, IIUC, the cost of using ListView with out-of-order values over
> > > > > ListArray is you lose memory locality; the values of element 2 are no
> > > > > longer adjacent to the values of element 1. What do you think about
> > that
> > > > > tradeoff?
> > > > >
> > > > > I don't mean to be difficult about this. I'm excited for both the
> > REE and
> > > > > StringView arrays, but this one I'm not so sure about yet. I suppose
> > > > what I
> > > > > am trying to ask is, if we added this, do we think many Arrow and
> > query
> > > > > engine implementations (for example, DataFusion) will be eager to add
> > > > full
> > > > > support for the type, including compute kernels? Or are they likely
> > to
> > > > just
> > > > > convert this type to ListArray at import boundaries?
> > > > >
> > > > > Because if it turns out to be the latter, then we might as well ask
> > Velox
> > > > > to export this type as ListArray and save the rest of the ecosystem
> > some
> > > > > work.
> > > > >
> > > > > Best,
> > > > >
> > > > > Will Jones
> > > > >
> > > > > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> Initial reason for ListView arrays in Arrow is zero-copy
> > compatibility
> > > > with
> > > > >> Velox which uses this format.
> > > > >>
> > > > >> Velox kernels benefit from being able to append data to the array
> > from
> > > > >> different threads without care for strict ordering. Only the offsets
> > > > array
> > > > >> has to be written according to logical order but that is
> > potentially a
> > > > much
> > > > >> smaller buffer than the values buffer.
> > > > >>
> > > > >> Acero kernels could take advantage of that in the future.
> > > > >>
> > > > >> In implementing ListViewArray/Type I was able to reuse some C++
> > > > templates
> > > > >> used for ListArray which can reduce some of the burden on kernel
> > > > >> implementations that aim to work with all the types.
> > > > >>
> > > > >> I’m can fix Acero kernels for working with ListView. This is
> > similar to
> > > > the
> > > > >> work I’ve doing in kernels dealing with run-end encoded arrays.
> > > > >>
> > > > >> —
> > > > >> Felipe
> > > > >>
> > > > >>
> > > > >> On Wed, 26 Apr 2023 at 01:03 Will Jones <[email protected]>
> > > > wrote:
> > > > >>
> > > > >>> I suppose one common use case is materializing list columns after
> > some
> > > > >>> expanding operation like a join or unnest. That's a case where I
> > could
> > > > >>> imagine a lot of repetition of values. Haven't yet thought of
> > common
> > > > >> cases
> > > > >>> where there is overlap but not full duplication, but am eager to
> > hear
> > > > >> any.
> > > > >>> The dictionary encoding point Raphael makes is interesting,
> > especially
> > > > >>> given the existence of LargeList and FixedSizeList. For many
> > > > operations,
> > > > >> it
> > > > >>> might make more sense to just compose those existing types.
> > > > >>>
> > > > >>> IIUC the operations that would be unique to the ArrayView are ones
> > > > >> altering
> > > > >>> the shape. One could truncate each array to a certain length
> > cheaply
> > > > >> simply
> > > > >>> by replacing the sizes buffer. Or perhaps there are interesting
> > > > >> operations
> > > > >>> on tensors that would benefit.
> > > > >>>
> > > > >>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
> > > > >>> <[email protected]> wrote:
> > > > >>>
> > > > >>>> Unless I am missing something, I think the selection use-case
> > could be
> > > > >>>> equally well served by a dictionary-encoded BinarArray/ListArray,
> > and
> > > > >>> would
> > > > >>>> have the benefit of not requiring any modifications to the
> > existing
> > > > >>> format
> > > > >>>> or kernels.
> > > > >>>>
> > > > >>>> The major additional flexibility of the proposed encoding would be
> > > > >>>> permitting disjoint or overlapping ranges, are these common
> > enough in
> > > > >>>> practice to represent a meaningful bottleneck?
> > > > >>>>
> > > > >>>>
> > > > >>>> On 26 April 2023 01:40:14 BST, David Li <[email protected]>
> > wrote:
> > > > >>>>> Is there a need for a 64-bit offsets version the same way we have
> > > > List
> > > > >>>> and LargeList?
> > > > >>>>> And just to be clear, the difference with List is that the lists
> > > > don't
> > > > >>>> have to be stored in their logical order (or in other words,
> > offsets
> > > > do
> > > > >>> not
> > > > >>>> have to be nondecreasing and so we also need sizes)?
> > > > >>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > > > >>>>>> For context, there was some discussion on this back in [1].  At
> > that
> > > > >>>> time
> > > > >>>>>> this was called "sequence view" but I do not like that name.
> > > > >> However,
> > > > >>>>>> array-view array is a little confusing.  Given this is similar
> > to
> > > > >> list
> > > > >>>> can
> > > > >>>>>> we go with list-view array?
> > > > >>>>>>
> > > > >>>>>>> Thanks for the introduction. I'd be interested to hear about
> > the
> > > > >>>>>>> applications Velox has found for these vectors, and in what
> > > > >>> situations
> > > > >>>>>> they
> > > > >>>>>>> are useful. This could be contrasted with the current ListArray
> > > > >>>>>>> implementations.
> > > > >>>>>> I believe one significant benefit is that take (and by proxy,
> > > > >> filter)
> > > > >>>> and
> > > > >>>>>> sort are O(# of items) with the proposed format and O(# of
> > bytes)
> > > > >> with
> > > > >>>> the
> > > > >>>>>> current format.  Jorge did some profiling to this effect in [1].
> > > > >>>>>>
> > > > >>>>>> [1]
> > > > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > > >>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones <
> > [email protected]
> > > > >>>> wrote:
> > > > >>>>>>> Hi Felipe,
> > > > >>>>>>>
> > > > >>>>>>> Thanks for the introduction. I'd be interested to hear about
> > the
> > > > >>>>>>> applications Velox has found for these vectors, and in what
> > > > >>> situations
> > > > >>>> they
> > > > >>>>>>> are useful. This could be contrasted with the current ListArray
> > > > >>>>>>> implementations.
> > > > >>>>>>>
> > > > >>>>>>> IIUC it would be fairly cheap to transform a ListArray to an
> > > > >>>> ArrayView, but
> > > > >>>>>>> expensive to go the other way.
> > > > >>>>>>>
> > > > >>>>>>> Best,
> > > > >>>>>>>
> > > > >>>>>>> Will Jones
> > > > >>>>>>>
> > > > >>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
> > > > >>>>>>> [email protected]> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi folks,
> > > > >>>>>>>>
> > > > >>>>>>>> I would like to start a public discussion on the inclusion of
> > a
> > > > >> new
> > > > >>>> array
> > > > >>>>>>>> format to Arrow — array-view array. The name is also up for
> > > > >> debate.
> > > > >>>>>>>> This format is inspired by Velox's ArrayVector format [1].
> > > > >>> Logically,
> > > > >>>>>>> this
> > > > >>>>>>>> array represents an array of arrays. Each element is an
> > > > >> array-view
> > > > >>>>>>> (offset
> > > > >>>>>>>> and size pair) that points to a range within a nested "values"
> > > > >>> array
> > > > >>>>>>>> (called "elements" in Velox docs). The nested array can be of
> > any
> > > > >>>> type,
> > > > >>>>>>>> which makes this format very flexible and powerful.
> > > > >>>>>>>>
> > > > >>>>>>>> [image: ../_images/array-vector.png]
> > > > >>>>>>>> <
> > > > >>> https://facebookincubator.github.io/velox/_images/array-vector.png
> > >
> > > > >>>>>>>> I'm currently working on a C++ implementation and plan to work
> > > > >> on a
> > > > >>>> Go
> > > > >>>>>>>> implementation to fulfill the two-implementations requirement
> > for
> > > > >>>> format
> > > > >>>>>>>> changes.
> > > > >>>>>>>>
> > > > >>>>>>>> The draft design:
> > > > >>>>>>>>
> > > > >>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets buffer, int32
> > sizes
> > > > >>>> buffer]
> > > > >>>>>>>> - 1 child array: "values" as an array of the type parameter
> > > > >>>>>>>>
> > > > >>>>>>>> validity_bitmap is used to differentiate between empty array
> > > > >> views
> > > > >>>>>>>> (sizes[i] == 0) and NULL array views (validity_bitmap[i] ==
> > 0).
> > > > >>>>>>>>
> > > > >>>>>>>> When the validity_bitmap[i] is 0, both sizes and offsets are
> > > > >>>> undefined
> > > > >>>>>>> (as
> > > > >>>>>>>> usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is
> > > > >>>> recommended
> > > > >>>>>>>> if setting a value is not an issue to the system producing the
> > > > >>>> arrays.
> > > > >>>>>>>> offsets buffer is not required to be ordered and views don't
> > have
> > > > >>> to
> > > > >>>> be
> > > > >>>>>>>> disjoint.
> > > > >>>>>>>>
> > > > >>>>>>>> [1]
> > > > >>>>>>>>
> > > > >>
> > > >
> > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > > >>>>>>>> Thanks,
> > > > >>>>>>>> Felipe O. Carvalho
> > > > >>>>>>>>
> > > >
> >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to