Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Felipe Oliveira Carvalho Wed, 14 Jun 2023 09:53:17 -0700

General approach to alternative formats aside, in the specific case of
ListView, I think the implementation complexity is being overestimated in
these discussions.


The C++ Arrow implementation shares a lot of code between List and
LargeList. And with some tweaks, I'm able to share that common
infrastructure for ListView as well. [1]

ListView is similar to list: it doesn't require offsets to be sorted and
adds an extra buffer containing sizes. For symmetry with the List and
LargeList types (FixedSizeList not included), I'm going to propose we add a
LargeListView. That is not part of the draft implementation yet, but seems
like an obvious thing to have now that I implemented the `if_else`
specialization. [2] David Li asked about this above and I can confirm now
that 64-bit version of ListView (LargeListView) is in the plans.

Trying to avoid re-implementing some kernels is not a good goal to chase,
IMO, because kernels need tweaks to take advantage of the format.

[1] https://github.com/apache/arrow/pull/35345
[2] https://github.com/felipecrv/arrow/commits/list_view_backup
--
Felipe

On Wed, Jun 14, 2023 at 12:08 PM Weston Pace <[email protected]> wrote:

> > perhaps we could support this use-case as
> > a canonical extension type over dictionary encoded, variable-sized
> > arrays
>
> I believe this suggestion is valid and could be used to solve the if-else
> case.  The algorithm, if I understand it, would be roughly:
>
> ```
> // Note: Simple pseudocode, vectorization left as exercise for the reader
> auto indices_builder = ...
> auto list_builder = ...
> indices_builder.resize(batch.length);
> Array condition_mask = condition.EvaluateBatch(batch);
> for row_index in selected_rows(condition_mask):
>   indices_builder[row_index] = list_builder.CurrentLength();
>   list_builder.Append(if_expr.EvaluateRow(batch, row_index))
> for row_index in unselected_rows(condition_mask):
>   indices_builder[row_index] = list_builder.CurrentLength();
>   list_builder.Append(else_expr.EvaluateRow(batch, row_index))
> return DictionaryArray(indices_builder.Finish(), list_builder.Finish())
> ```
>
> I also agree this is a slightly awkward use of dictionaries (e.g. the
> dictionary would have the same size as the # of indices) and perhaps not
> the most intuitive way to solve the problem.
>
> My gut reaction is simply "an improved if/else kernel is not, alone, enough
> justification for a new layout" and yet...
>
> I think we are seeing the start (I hope) of a trend where Arrow is not just
> used "between systems" (e.g. to shuttle data from one place to another, or
> between a query engine and a visualization tool) but also "within systems"
> (e.g. UDFs, bespoke file formats and temporary tables, between workers in a
> distributed query engine).  When arrow is used "within systems" I think
> both the number of bespoke formats and the significance of conversion cost
> increases.  For example, it's easy to say that Velox should convert at the
> boundary as data leaves Velox.  But what if Velox (or datafusion or ...)
> were to define an interface for UDFs.  Would we want to use Arrow there
> (e.g. the C data interface is a good fit)?  If so, wouldn't the conversion
> cost be more significant?
>
> > Also, I'm very lukewarm towards the concept of "alternative layouts"
> > suggested somewhere else in this thread. It does not seem a good choice
> > to complexify the Arrow format that much.
>
> I think, in my opinion, this depends on how many of these alternative
> layouts exist.  If there are just a few, then I agree, we should just adopt
> them as formal first-class layouts.  If there are many, then I think it
> will be too much complexity in Arrow to have all the different choices.
> Or, we could say there are many, but the alternatives don't belong in Arrow
> at all.  In that case I think it's the same question as the above
> paragraph, "do we want Arrow to be used within systems?  Or just between
> systems?"
>
>
> On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > I agree that ListView cannot be an extension type, given that it
> > features a new layout, and therefore cannot reasonably be backed by an
> > existing storage type (AFAICT).
> >
> > Also, I'm very lukewarm towards the concept of "alternative layouts"
> > suggested somewhere else in this thread. It does not seem a good choice
> > to complexify the Arrow format that much.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 07/06/2023 à 00:21, Felipe Oliveira Carvalho a écrit :
> > > +1 on what Ian said.
> > >
> > > And as I write kernels for this new format, I’m learning that it’s
> > possible
> > > to re-use the common infrastructure used by List and LargeList to
> > implement
> > > the ListView related features with some adjustments.
> > >
> > > IMO having this format as a second-class citizen would more likely
> > > complicate things because it would make this unification harder.
> > >
> > > —
> > > Felipe
> > >
> > > On Tue, 6 Jun 2023 at 18:45 Ian Cook <[email protected]> wrote:
> > >
> > >> To clarify why we cannot simply propose adding ListView as a new
> > >> “canonical extension type”: The extension type mechanism in Arrow
> > >> depends on the underlying data being organized in an existing Arrow
> > >> layout—that way an implementation that does not support the extension
> > >> type can still handle the underlying data. But ListView is a wholly
> > >> new layout.
> > >>
> > >> I strongly agree with Weston’s idea that it is a good time for Arrow
> > >> to introduce the notion of “canonical alternative layouts.”
> > >>
> > >> Taken together, I think that canonical extension types and canonical
> > >> alternative layouts could serve as an “incubator” for proposed new
> > >> representations. For example, if a proposed canonical alternative
> > >> layout ends up being broadly adopted, then that will serve as a signal
> > >> that we should consider adding it as a primary layout in the core
> > >> spec.
> > >>
> > >> It seems to me that most projects that are implementing Arrow today
> > >> are not aiming to provide complete coverage of Arrow; rather they are
> > >> adopting Arrow because of its role as a standard and they are
> > >> implementing only as much of the Arrow standard as they require to
> > >> achieve some goal. I believe that such projects are important Arrow
> > >> stakeholders, and I believe that this proposed notion of canonical
> > >> alternative layouts will serve them well and will create efficiencies
> > >> by standardizing implementations around a shared set of alternatives.
> > >>
> > >> However I think that the documentation for canonical alternative
> > >> layouts should strongly encourage implementers to default to using the
> > >> primary layouts defined in the core spec and only use alternative
> > >> layouts in cases where the primary layouts do not meet their needs.
> > >>
> > >>
> > >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield <
> [email protected]>
> > >> wrote:
> > >>>
> > >>> This sounds reasonable to me but my main concern is, I'm not sure
> there
> > >> is
> > >>> a great mechanism to enforce canonical layouts don't somehow become
> > >> default
> > >>> (or the only implementation).
> > >>>
> > >>> Even for these new layouts, I think it might be worth rethinking
> > binding
> > >> a
> > >>> layout into the schema versus having a different concept of encoding
> > (and
> > >>> changing some of the corresponding data structures).
> > >>>
> > >>>
> > >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace <[email protected]>
> > >> wrote:
> > >>>
> > >>>> Trying to settle on one option is a fruitless endeavor.  Each type
> has
> > >> pros
> > >>>> and cons.  I would also predict that the largest existing usage of
> > >> Arrow is
> > >>>> shuttling data from one system to another.  The newly proposed
> format
> > >>>> doesn't appear to have any significant advantage for that use case
> (if
> > >>>> anything, the existing format is arguably better as it is more
> > >> compact).
> > >>>>
> > >>>> I am very biased towards historical precedent and avoiding breaking
> > >>>> changes.
> > >>>>
> > >>>> We have "canonical extension types", perhaps it is time for
> "canonical
> > >>>> alternative layouts".  We could define it as such:
> > >>>>
> > >>>>   * There are one or more primary layouts
> > >>>>     * Existing layouts are automatically considered primary layouts,
> > >> even if
> > >>>> they wouldn't
> > >>>>       have been primary layouts initially (e.g. large list)
> > >>>>   * A new layout, if it is semantically equivalent to another, is
> > >> considered
> > >>>> an alternative layout
> > >>>>   * An alternative layout still has the same requirements for
> adoption
> > >> (two
> > >>>> implementations and a vote)
> > >>>>     * An implementation should not feel pressured to rush and
> > implement
> > >> the
> > >>>> new layout.
> > >>>>       It would be good if they contribute in the discussion and
> > >> consider the
> > >>>> layout and vote
> > >>>>       if they feel it would be an acceptable design.
> > >>>>   * We can define and vote and approve as many canonical alternative
> > >> layouts
> > >>>> as we want:
> > >>>>     * A canonical alternative layout should, at a minimum, have some
> > >>>>       reasonable justification, such as improved performance for
> > >> algorithm X
> > >>>>   * Arrow implementations MUST support the primary layouts
> > >>>>   * An Arrow implementation MAY support a canonical alternative,
> > >> however:
> > >>>>     * An Arrow implementation MUST first support the primary layout
> > >>>>     * An Arrow implementation MUST support conversion to/from the
> > >> primary
> > >>>> and canonical layout
> > >>>>     * An Arrow implementation's APIs MUST only provide data in the
> > >>>> alternative
> > >>>>       layout if it is explicitly asked for (e.g. schema inference
> > should
> > >>>> prefer the primary layout).
> > >>>>   * We can still vote for new primary layouts (e.g. promoting a
> > >> canonical
> > >>>> alternative) but, in these
> > >>>>      votes we don't only consider the value (e.g. performance) of
> the
> > >> layout
> > >>>> but also the interoperability.
> > >>>>      In other words, a layout can only become a primary layout if
> > there
> > >> is
> > >>>> significant evidence that most
> > >>>>      implementations plan to adopt it.
> > >>>>
> > >>>> This lets us evolve support for new layouts more naturally.  We can
> > >>>> generally assume that users will not, initially, be aware of these
> > >>>> alternative layouts.  However, everything will just work.  They may
> > >> start
> > >>>> to see a performance penalty stemming from a lack of support for
> these
> > >>>> layouts.  If this performance penalty becomes significant then they
> > >> will
> > >>>> discover it and become aware of the problem.  They can then ask
> > >> whatever
> > >>>> library they are using to add support for the alternative layout.
> As
> > >>>> enough users find a need for it then libraries will add support.
> > >>>> Eventually, enough libraries will support it that we can adopt it
> as a
> > >>>> primary layout.
> > >>>>
> > >>>> Also, it allows libraries to adopt alternative layouts more
> > >> aggressively if
> > >>>> they would like while still hopefully ensuring that we eventually
> all
> > >>>> converge on the same implementation of the alternative layout.
> > >>>>
> > >>>> On Mon, May 22, 2023 at 9:35 AM Will Jones <[email protected]
> >
> > >>>> wrote:
> > >>>>
> > >>>>> Hello Arrow devs,
> > >>>>>
> > >>>>> I don't understand why we would start deprecating features in the
> > >> Arrow
> > >>>>>> format. Even starting this talk might already be a bad idea
> > >> PR-wise.
> > >>>>>>
> > >>>>>
> > >>>>> I agree we don't want to make breaking changes to the Arrow format.
> > >> But
> > >>>>> several maintainers have already stated they have no interest in
> > >>>>> maintaining both list types with full compute functionality [1][2],
> > >> so I
> > >>>>> think it's very likely one list type or the other will be
> > >>>>> implicitly preferred in the ecosystem if this data type was added.
> > >> If
> > >>>>> that's the case, I'd prefer that we agreed as a community which one
> > >>>> should
> > >>>>> be preferred. Maybe that's not the best path; it's just one way for
> > >> us to
> > >>>>> balance stability, maintenance burden, and relevance.
> > >>>>>
> > >>>>> Can someone help distill down the primary rationale and usecase for
> > >>>>>> adding ArrayView to the Arrow Spec?
> > >>>>>>
> > >>>>>
> > >>>>> Looking back at that old thread, I think one of the main
> motivations
> > >> is
> > >>>> to
> > >>>>> try to prevent query engine implementers from feeling there is a
> > >> tradeoff
> > >>>>> between having state-of-the-art performance and being Arrow-native.
> > >> For
> > >>>>> some of the new array types, we had both Velox and DuckDB use them,
> > >> so it
> > >>>>> was reasonable to expect they were innovations that might
> > >> proliferate.
> > >>>> I'm
> > >>>>> not sure if the ArrayView is part of that. From Wes earlier [3]:
> > >>>>>
> > >>>>> The idea is that in a world of data and query federation (for
> > >> example,
> > >>>>>> consider [1] where Arrow is being used as a data federation layer
> > >> with
> > >>>>> many
> > >>>>>> query engines), we want to increase the amount of data in-flight
> > >> and
> > >>>>>> in-memory that is in Arrow format. So if query engines are having
> > >> to
> > >>>>> depart
> > >>>>>> substantially from the Arrow format to get performance, then this
> > >>>>> creates a
> > >>>>>> potential lose-lose situation: * Depart from Arrow: get better
> > >>>>> performance
> > >>>>>> but pay serialization costs to read and write Arrow (the
> > >> performance
> > >>>> and
> > >>>>>> resource utilization benefits outweigh the serialization costs).
> > >> This
> > >>>>> puts
> > >>>>>> additional pressure on query engines to build specialized
> > >> components
> > >>>> for
> > >>>>>> solving problems rather than making use of off-the-shelf
> components
> > >>>> that
> > >>>>>> use Arrow. This has knock-on effects on ecosystem fragmentation. *
> > >> Or
> > >>>> use
> > >>>>>> Arrow, and accept suboptimal query processing performance
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> Will mentions one usecase is Velox consuming python UDF output,
> which
> > >>>> seems
> > >>>>>> to be mostly about how fast Velox can consume this format, not how
> > >> fast
> > >>>>> it
> > >>>>>> can be written. Are there other usecases?
> > >>>>>>
> > >>>>>
> > >>>>> To be clear, I don't know if that's the use case they want. That's
> > >> just
> > >>>> me
> > >>>>> speculating.
> > >>>>>
> > >>>>> I still have some questions myself:
> > >>>>>
> > >>>>> 1. Is this array type currently only used in Velox? (not DuckDB
> like
> > >> some
> > >>>>> of the other new types?) What evidence do we have that it will
> become
> > >>>> used
> > >>>>> outside of Velox?
> > >>>>> 2. We already have three list types: list, large list (64-bit
> > >> offsets),
> > >>>> and
> > >>>>> fixed size list. Do we think we will only want a view version of
> the
> > >>>> 32-bit
> > >>>>> offset variable length list? Or are we potentially talking about
> view
> > >>>>> variants for all three?
> > >>>>>
> > >>>>> Best,
> > >>>>>
> > >>>>> Will Jones
> > >>>>>
> > >>>>> [1]
> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > >>>>> [2]
> https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3
> > >>>>> [3]
> https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t
> > >>>>>
> > >>>>> On Mon, May 22, 2023 at 3:48 AM Andrew Lamb <[email protected]>
> > >>>> wrote:
> > >>>>>
> > >>>>>> Can someone help distill down the primary rationale and usecase
> for
> > >>>>>> adding ArrayView to the Arrow Spec?
> > >>>>>>
> > >>>>>>  From the above discussions, the stated rationale seems to be fast
> > >>>>>> (zero-copy) interchange with Velox.
> > >>>>>>
> > >>>>>> This thread has qualitatively enumerated the benefits of
> > >> (offset+len)
> > >>>>>> encoding over the existing Arrow ListArray (offets) approach, but
> I
> > >>>>> haven't
> > >>>>>> seen any performance measurements that might help us to gauge the
> > >>>>> tradeoff
> > >>>>>> in additional complexity vs runtime overhead.
> > >>>>>>
> > >>>>>> Will mentions one usecase is Velox consuming python UDF output,
> > >> which
> > >>>>> seems
> > >>>>>> to be mostly about how fast Velox can consume this format, not how
> > >> fast
> > >>>>> it
> > >>>>>> can be written. Are there other usecases?
> > >>>>>>
> > >>>>>> Do we have numbers showing how much overhead converting to /from
> > >>>> Velox's
> > >>>>>> internal representation and the existing ListArray adds? Has
> > >> anyone in
> > >>>>>> Velox land considered adding faster support for Arrow style
> > >> ListArray
> > >>>>>> encoding?
> > >>>>>>
> > >>>>>>
> > >>>>>> Andrew
> > >>>>>>
> > >>>>>> On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou <
> [email protected]
> > >>>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> I don't understand why we would start deprecating features in the
> > >>>> Arrow
> > >>>>>>> format. Even starting this talk might already be a bad idea
> > >> PR-wise.
> > >>>>>>>
> > >>>>>>> As for implementing conversions at the I/O boundary, it's a
> > >>>> reasonably
> > >>>>>>> policy, but it still requires work by implementors and it's not
> > >>>> granted
> > >>>>>>> that all consumers of the Arrow format will grow such conversions
> > >>>>>>> if/when we add non-trivial types such as ListView or StringView.
> > >>>>>>>
> > >>>>>>> Regards
> > >>>>>>>
> > >>>>>>> Antoine.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Le 22/05/2023 à 00:39, Will Jones a écrit :
> > >>>>>>>> One more thing: Looking back on the previous discussion[1]
> > >> (which
> > >>>>>> Weston
> > >>>>>>>> pointed out in his earlier message), Jorge suggested that the
> > >> old
> > >>>>> list
> > >>>>>>>> types might be deprecated in favor of view variants [2]. Others
> > >>>> were
> > >>>>>>>> worried that it might undermine the perception that the Arrow
> > >>>> format
> > >>>>> is
> > >>>>>>>> stable. I think it might be worth thinking about "soft
> > >> deprecating"
> > >>>>> the
> > >>>>>>> old
> > >>>>>>>> list type: suggesting new implementations prefer the list
> > >> view, but
> > >>>>>>>> reassuring that implementations should support the old format,
> > >> even
> > >>>>> if
> > >>>>>>> they
> > >>>>>>>> just convert to the new format. To be clear, this wouldn't
> > >> mean we
> > >>>>> plan
> > >>>>>>> to
> > >>>>>>>> create breaking changes in the format; but if we ever did for
> > >> other
> > >>>>>>>> reasons, the old list type might go.
> > >>>>>>>>
> > >>>>>>>> Arrow compute libraries could choose either format for compute
> > >>>>> support,
> > >>>>>>> and
> > >>>>>>>> plan to do conversion at the boundaries. Libraries that use
> > >> the new
> > >>>>>> type
> > >>>>>>>> will have cheap conversion when reading the old type. Meanwhile
> > >>>> those
> > >>>>>>> that
> > >>>>>>>> are still on the old type will have some incentive to move
> > >> towards
> > >>>>> the
> > >>>>>>> new
> > >>>>>>>> one, since that conversion will not be as efficient.
> > >>>>>>>>
> > >>>>>>>> [1]
> > >>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > >>>>>>>> [2]
> > >>>> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > >>>>>>>>
> > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones <
> > >>>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hello,
> > >>>>>>>>>
> > >>>>>>>>> I think Sasha brings up a good point, that the advantages of
> > >> this
> > >>>>>> format
> > >>>>>>>>> seem to be primarily about query processing. Other encodings
> > >> like
> > >>>>> REE
> > >>>>>>> and
> > >>>>>>>>> dictionary have space-saving advantages that justify them
> > >> simply
> > >>>> in
> > >>>>>>> terms
> > >>>>>>>>> of space efficiency (although they have query processing
> > >>>> advantages
> > >>>>> as
> > >>>>>>>>> well). As discussed, most use cases are already well served by
> > >>>>>> existing
> > >>>>>>>>> list types and dictionary encoding.
> > >>>>>>>>>
> > >>>>>>>>> I agree that there are cases where transferring this type
> > >> without
> > >>>>>>>>> conversion would be ideal. One use case I can think of is if
> > >> Velox
> > >>>>>>> wants to
> > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with
> > >> PyArrow,
> > >>>> for
> > >>>>>>>>> example) that operate on this column type and therefore wants
> > >>>>>> zero-copy
> > >>>>>>>>> exchange over the C Data Interface.
> > >>>>>>>>>
> > >>>>>>>>> One big question I have: we already have three list types:
> > >> list,
> > >>>>> large
> > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we
> > >> will
> > >>>> only
> > >>>>>>> want a
> > >>>>>>>>> view version of the 32-bit offset variable length list? Or
> > >> are we
> > >>>>>>>>> potentially talking about view variants for all three?
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>>
> > >>>>>>>>> Will Jones
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> > >>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> The benefit of having a memory format that’s friendly to
> > >>>>>>> non-deterministic
> > >>>>>>>>>> order writes is unlocked by the transport and processing of
> > >> the
> > >>>>> data
> > >>>>>>> being
> > >>>>>>>>>> agnostic to the physical order as much as possible.
> > >>>>>>>>>>
> > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But it
> > >> can
> > >>>>> be a
> > >>>>>>>>>> provisory step for compatibility between systems that don’t
> > >>>>>> understand
> > >>>>>>> the
> > >>>>>>>>>> format yet. This is similar to the situation with compression
> > >>>>> schemes
> > >>>>>>> like
> > >>>>>>>>>> run-end encoding — the goal is processing the compressed data
> > >>>>>> directly
> > >>>>>>>>>> without an expansion step whenever possible.
> > >>>>>>>>>>
> > >>>>>>>>>> This is why having it as part of the open Arrow format is so
> > >>>>>> important:
> > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel
> > >> and/or
> > >>>>>>>>>> vectorized compute kernels without introducing multiple
> > >>>>> incompatible
> > >>>>>>>>>> formats to the ecosystem and without imposing a conversion
> > >> step
> > >>>>>> between
> > >>>>>>>>>> the
> > >>>>>>>>>> different systems.
> > >>>>>>>>>>
> > >>>>>>>>>> —
> > >>>>>>>>>> Felipe
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin
> > >> <[email protected]>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> I don't feel like this representation is necessarily a
> > >> detail of
> > >>>>> the
> > >>>>>>>>>> query
> > >>>>>>>>>>> engine, but I am also not sure why this representation would
> > >>>> have
> > >>>>> to
> > >>>>>>> be
> > >>>>>>>>>>> converted to a non-view format when serializing. Could you
> > >>>> clarify
> > >>>>>>>>>> that? My
> > >>>>>>>>>>> impression is that this representation could be used for
> > >>>>> persistence
> > >>>>>>> or
> > >>>>>>>>>>> data transfer, though it can be more complex to guarantee
> > >> the
> > >>>>>> portion
> > >>>>>>> of
> > >>>>>>>>>>> the buffer that an index points to is also present in
> > >> memory.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Sent from Proton Mail for iOS
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > >>>>>>>>>> [email protected]
> > >>>>>>>>>>>
> > >> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>>
> > >>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>> I understand that there are numerous benefits to this
> > >>>>> representation
> > >>>>>>>>>>> during query processing, but would it be fair to say that
> > >> this
> > >>>> is
> > >>>>> an
> > >>>>>>>>>>> implementation detail of the query engine? Query engines
> > >> don’t
> > >>>>>>>>>> necessarily
> > >>>>>>>>>>> need to conform to the Arrow format internally, only at
> > >>>>>> ingest/egress
> > >>>>>>>>>>> points, and performing a conversion from the non-view to
> > >> view
> > >>>>> format
> > >>>>>>>>>> seems
> > >>>>>>>>>>> like it would be very cheap (though I understand not
> > >> necessarily
> > >>>>> the
> > >>>>>>>>>> other
> > >>>>>>>>>>> way around, but you’d need to do that anyway if you’re
> > >>>>> serializing).
> > >>>>>>>>>>>
> > >>>>>>>>>>> Sasha Krassovsky
> > >>>>>>>>>>>
> > >>>>>>>>>>>> 20 мая 2023 г., в 13:00, Will Jones <
> > >> [email protected]>
> > >>>>>>>>>>> написал(а):
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for sharing these details, Pedro. The conditional
> > >>>>> branches
> > >>>>>>>>>>> argument
> > >>>>>>>>>>>> makes a lot of sense to me.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The tensors point brings up some interesting issues. For
> > >> now,
> > >>>>> we've
> > >>>>>>>>>>> defined
> > >>>>>>>>>>>> our only tensor extension type to be built on a fixed size
> > >>>> list.
> > >>>>>> If a
> > >>>>>>>>>> use
> > >>>>>>>>>>>> case of this might be manipulating tensors with zero copy,
> > >>>>> perhaps
> > >>>>>>>>>> that
> > >>>>>>>>>>>> suggests that we want a fixed size list variant? In
> > >> addition,
> > >>>>> would
> > >>>>>>> we
> > >>>>>>>>>>> have
> > >>>>>>>>>>>> to define another extension type to be a ListView variant?
> > >> Or
> > >>>>> would
> > >>>>>>> we
> > >>>>>>>>>>> want
> > >>>>>>>>>>>> to think about making extension types somehow valid across
> > >>>>> various
> > >>>>>>>>>>>> encodings of the same "logical type"?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha
> > >> Pedreira
> > >>>>>>>>>>>>> <[email protected]> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> This is Pedro from the Velox team at Meta. This is my
> > >> first
> > >>>> time
> > >>>>>>>>>> here,
> > >>>>>>>>>>> so
> > >>>>>>>>>>>>> nice to e-meet you!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Adding to what Felipe said, the main reason we created
> > >>>>> “ListView”
> > >>>>>>>>>>> (though
> > >>>>>>>>>>>>> we just call them ArrayVector/MapVector in Velox) is that,
> > >>>> along
> > >>>>>>> with
> > >>>>>>>>>>>>> StringViews for strings, they allow us to write any
> > >> columnar
> > >>>>>> buffer
> > >>>>>>>>>>>>> out-or-order, regardless of their types or encodings.
> > >> This is
> > >>>>>>>>>> naturally
> > >>>>>>>>>>>>> doable for all primitive types (fixed-size), but not for
> > >> types
> > >>>>>> that
> > >>>>>>>>>>> don’t
> > >>>>>>>>>>>>> have fixed size and are required to be contiguous. The
> > >>>>> StringView
> > >>>>>>> and
> > >>>>>>>>>>>>> ListView formats allow us to keep this invariant in Velox.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Being able to write vectors out-of-order is useful when
> > >>>>> executing
> > >>>>>>>>>>>>> conditionals like IF/SWITCH statements, which are
> > >> pervasive
> > >>>>> among
> > >>>>>>> our
> > >>>>>>>>>>>>> workloads. To fully vectorize it, one first evaluates the
> > >>>>>>> expression,
> > >>>>>>>>>>> then
> > >>>>>>>>>>>>> generate a bitmap containing which rows take the THEN and
> > >>>> which
> > >>>>>> take
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> ELSE branch. Then you populate all rows that match the
> > >> first
> > >>>>>> branch
> > >>>>>>>>>> by
> > >>>>>>>>>>>>> evaluating the THEN expression in a vectorized
> > >> (branch-less
> > >>>> and
> > >>>>>>> cache
> > >>>>>>>>>>>>> friendly) way, and subsequently the ELSE branch. If you
> > >> can’t
> > >>>>>> write
> > >>>>>>>>>> them
> > >>>>>>>>>>>>> out-of-order, you would either have a big branch per row
> > >>>>>> dispatching
> > >>>>>>>>>> to
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> right expression (slow), or populate two distinct vectors
> > >> then
> > >>>>>>>>>> merging
> > >>>>>>>>>>> them
> > >>>>>>>>>>>>> at the end (potentially even slower). How much faster our
> > >>>>> approach
> > >>>>>>> is
> > >>>>>>>>>>>>> highly depends on the buffer sizes and expressions, but we
> > >>>> found
> > >>>>>> it
> > >>>>>>>>>> to
> > >>>>>>>>>>> be
> > >>>>>>>>>>>>> faster enough on average to justify us extending the
> > >>>> underlying
> > >>>>>>>>>> layout.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> With that said, this is all within a single thread of
> > >>>> execution.
> > >>>>>>>>>>>>> Parallelization is done by feeding each thread its own
> > >>>>>> vector/data.
> > >>>>>>>>>> As
> > >>>>>>>>>>>>> pointed out in a previous message, this also gives you the
> > >>>>>>>>>> flexibility
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> implement cardinality increasing/reducing operations, but
> > >> we
> > >>>>> don’t
> > >>>>>>>>>> use
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>> for that purpose. Operations like filtering, joining,
> > >>>> unnesting
> > >>>>>> and
> > >>>>>>>>>>> similar
> > >>>>>>>>>>>>> are done by wrapping the internal vector in a dictionary,
> > >> as
> > >>>>> these
> > >>>>>>>>>> need
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> work not only on “ListViews” but on any data types with
> > >> any
> > >>>>>>> encoding.
> > >>>>>>>>>>> There
> > >>>>>>>>>>>>> are more details on Section 4.2.1 in [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Beyond this, it also gives function/kernel developers more
> > >>>>>>>>>> flexibility
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> implement operations that manipulate Arrays/Maps. For
> > >> example,
> > >>>>>>>>>>> operations
> > >>>>>>>>>>>>> that slice these containers can be implemented in a
> > >> zero-copy
> > >>>>>> manner
> > >>>>>>>>>> by
> > >>>>>>>>>>>>> just rearranging the lengths/offsets indices, without ever
> > >>>>>> touching
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> larger internal buffers. This is a similar motivation as
> > >> for
> > >>>>>>>>>> StringView
> > >>>>>>>>>>>>> (think of substr(), trim(), and similar). One nice last
> > >>>> property
> > >>>>>> is
> > >>>>>>>>>> that
> > >>>>>>>>>>>>> this layout allows for overlapping ranges. This is
> > >> something
> > >>>>>>>>>> discussed
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>> our ML people to allow deduping feature values in a tensor
> > >>>>> (which
> > >>>>>> is
> > >>>>>>>>>>> fairly
> > >>>>>>>>>>>>> common), but not something we have leveraged just yet.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Pedro Pedreira
> > >>>>>>>>>>>>> ________________________________
> > >>>>>>>>>>>>> From: Felipe Oliveira Carvalho <[email protected]>
> > >>>>>>>>>>>>> Sent: Friday, May 19, 2023 10:01 AM
> > >>>>>>>>>>>>> To: [email protected] <[email protected]>
> > >>>>>>>>>>>>> Cc: Pedro Eugenio Rocha Pedreira <[email protected]>
> > >>>>>>>>>>>>> Subject: Re: [DISCUSS][Format] Starting the draft
> > >>>> implementation
> > >>>>>> of
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> ArrayView array format
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael
> > >> Taylor-Davies
> > >>>>> <r.
> > >>>>>>>>>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi All, >
> > >> if
> > >>>> we
> > >>>>>>> added
> > >>>>>>>>>>>>> this, do we think many Arrow and query > engine
> > >>>> implementations
> > >>>>>> (for
> > >>>>>>>>>>>>> example, DataFusion) will be
> > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerStart
> > >>>>>>>>>>>>> This Message Is From an External Sender
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerEnd
> > >>>>>>>>>>>>> +pedroerp
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
> > >>>>>>>>>>>>> <[email protected]> wrote:
> > >>>>>>>>>>>>> Hi All,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> if we added this, do we think many Arrow and query
> > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be
> > >>>> eager
> > >>>>> to
> > >>>>>>>>>> add
> > >>>>>>>>>>>>> full
> > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are
> > >> they
> > >>>>>> likely
> > >>>>>>>>>> to
> > >>>>>>>>>>>>> just
> > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > >>>>>>>>>>>>> I can't speak for query engines in general, but at least
> > >> for
> > >>>>>>> arrow-rs
> > >>>>>>>>>>>>> and by extension DataFusion, and based on my current
> > >>>>> understanding
> > >>>>>>> of
> > >>>>>>>>>>>>> the use-cases I would be rather hesitant to add support
> > >> to the
> > >>>>>>>>>> kernels
> > >>>>>>>>>>>>> for this array type, definitely instead favouring
> > >> conversion
> > >>>> at
> > >>>>>> the
> > >>>>>>>>>>>>> edges. We already have issues with the amount of code
> > >>>> generation
> > >>>>>>>>>>>>> resulting in binary bloat and long compile times, and I
> > >> worry
> > >>>>> this
> > >>>>>>>>>> would
> > >>>>>>>>>>>>> worsen this situation whilst not really providing
> > >> compelling
> > >>>>>>>>>> advantages
> > >>>>>>>>>>>>> for the vast majority of workloads that don't interact
> > >> with
> > >>>>> Velox.
> > >>>>>>>>>>>>> Whilst I can definitely see that the ListView
> > >> representation
> > >>>> is
> > >>>>>>>>>> probably
> > >>>>>>>>>>>>> a better way to represent variable length lists than what
> > >>>> arrow
> > >>>>>>>>>> settled
> > >>>>>>>>>>>>> upon, I'm not yet convinced it is sufficiently better to
> > >>>>>> incentivise
> > >>>>>>>>>>>>> broad ecosystem adoption.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Kind Regards,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Raphael Taylor-Davies
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 11/05/2023 21:20, Will Jones wrote:
> > >>>>>>>>>>>>>> Hi Felipe,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the additional details.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to
> > >> the
> > >>>>>> array
> > >>>>>>>>>> from
> > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > >> Only the
> > >>>>>>>>>> offsets
> > >>>>>>>>>>>>> array
> > >>>>>>>>>>>>>>> has to be written according to logical order but that is
> > >>>>>>>>>> potentially a
> > >>>>>>>>>>>>> much
> > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> It still seems to me like applications are still pretty
> > >>>> niche,
> > >>>>>> as I
> > >>>>>>>>>>>>> suspect
> > >>>>>>>>>>>>>> in most cases the benefits are outweighed by the costs.
> > >> The
> > >>>>>> benefit
> > >>>>>>>>>>> here
> > >>>>>>>>>>>>>> seems pretty limited: if you are trying to split work
> > >> between
> > >>>>>>>>>> threads,
> > >>>>>>>>>>>>>> usually you will have other levels such as array chunks
> > >> to
> > >>>>>>>>>> parallelize.
> > >>>>>>>>>>>>> And
> > >>>>>>>>>>>>>> if you have an incoming stream of row data, you'll want
> > >> to
> > >>>>> append
> > >>>>>>> in
> > >>>>>>>>>>>>>> predictable order to match the order of the other
> > >> arrays. Am
> > >>>> I
> > >>>>>>>>>> missing
> > >>>>>>>>>>>>>> something?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> And, IIUC, the cost of using ListView with out-of-order
> > >>>> values
> > >>>>>> over
> > >>>>>>>>>>>>>> ListArray is you lose memory locality; the values of
> > >> element
> > >>>> 2
> > >>>>>> are
> > >>>>>>>>>> no
> > >>>>>>>>>>>>>> longer adjacent to the values of element 1. What do you
> > >> think
> > >>>>>> about
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>>> tradeoff?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I don't mean to be difficult about this. I'm excited for
> > >> both
> > >>>>> the
> > >>>>>>>>>> REE
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>> StringView arrays, but this one I'm not so sure about
> > >> yet. I
> > >>>>>>> suppose
> > >>>>>>>>>>>>> what I
> > >>>>>>>>>>>>>> am trying to ask is, if we added this, do we think many
> > >> Arrow
> > >>>>> and
> > >>>>>>>>>> query
> > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be
> > >>>> eager
> > >>>>> to
> > >>>>>>>>>> add
> > >>>>>>>>>>>>> full
> > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are
> > >> they
> > >>>>>> likely
> > >>>>>>>>>> to
> > >>>>>>>>>>>>> just
> > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Because if it turns out to be the latter, then we might
> > >> as
> > >>>> well
> > >>>>>> ask
> > >>>>>>>>>>> Velox
> > >>>>>>>>>>>>>> to export this type as ListArray and save the rest of the
> > >>>>>> ecosystem
> > >>>>>>>>>>> some
> > >>>>>>>>>>>>>> work.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Will Jones
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira
> > >> Carvalho <
> > >>>>>>>>>>>>>> [email protected]<mailto:[email protected]>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Initial reason for ListView arrays in Arrow is zero-copy
> > >>>>>>>>>> compatibility
> > >>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>> Velox which uses this format.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to
> > >> the
> > >>>>>> array
> > >>>>>>>>>> from
> > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > >> Only the
> > >>>>>>>>>> offsets
> > >>>>>>>>>>>>> array
> > >>>>>>>>>>>>>>> has to be written according to logical order but that is
> > >>>>>>>>>> potentially a
> > >>>>>>>>>>>>> much
> > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Acero kernels could take advantage of that in the
> > >> future.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> In implementing ListViewArray/Type I was able to reuse
> > >> some
> > >>>>> C++
> > >>>>>>>>>>>>> templates
> > >>>>>>>>>>>>>>> used for ListArray which can reduce some of the burden
> > >> on
> > >>>>> kernel
> > >>>>>>>>>>>>>>> implementations that aim to work with all the types.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I’m can fix Acero kernels for working with ListView.
> > >> This is
> > >>>>>>>>>> similar
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> work I’ve doing in kernels dealing with run-end encoded
> > >>>>> arrays.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> —
> > >>>>>>>>>>>>>>> Felipe
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Wed, 26 Apr 2023 at 01:03 Will Jones <
> > >>>>>> [email protected]
> > >>>>>>>>>>>>> <mailto:[email protected]>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I suppose one common use case is materializing list
> > >> columns
> > >>>>>> after
> > >>>>>>>>>>> some
> > >>>>>>>>>>>>>>>> expanding operation like a join or unnest. That's a
> > >> case
> > >>>>> where
> > >>>>>> I
> > >>>>>>>>>>> could
> > >>>>>>>>>>>>>>>> imagine a lot of repetition of values. Haven't yet
> > >> thought
> > >>>> of
> > >>>>>>>>>> common
> > >>>>>>>>>>>>>>> cases
> > >>>>>>>>>>>>>>>> where there is overlap but not full duplication, but am
> > >>>> eager
> > >>>>>> to
> > >>>>>>>>>> hear
> > >>>>>>>>>>>>>>> any.
> > >>>>>>>>>>>>>>>> The dictionary encoding point Raphael makes is
> > >> interesting,
> > >>>>>>>>>>> especially
> > >>>>>>>>>>>>>>>> given the existence of LargeList and FixedSizeList. For
> > >>>> many
> > >>>>>>>>>>>>> operations,
> > >>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>> might make more sense to just compose those existing
> > >> types.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> IIUC the operations that would be unique to the
> > >> ArrayView
> > >>>> are
> > >>>>>>> ones
> > >>>>>>>>>>>>>>> altering
> > >>>>>>>>>>>>>>>> the shape. One could truncate each array to a certain
> > >>>> length
> > >>>>>>>>>> cheaply
> > >>>>>>>>>>>>>>> simply
> > >>>>>>>>>>>>>>>> by replacing the sizes buffer. Or perhaps there are
> > >>>>> interesting
> > >>>>>>>>>>>>>>> operations
> > >>>>>>>>>>>>>>>> on tensors that would benefit.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
> > >>>>>>>>>>>>>>>> <[email protected]> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Unless I am missing something, I think the selection
> > >>>>> use-case
> > >>>>>>>>>> could
> > >>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>> equally well served by a dictionary-encoded
> > >>>>>>> BinarArray/ListArray,
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>> have the benefit of not requiring any modifications
> > >> to the
> > >>>>>>>>>> existing
> > >>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>>>> or kernels.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> The major additional flexibility of the proposed
> > >> encoding
> > >>>>>> would
> > >>>>>>>>>> be
> > >>>>>>>>>>>>>>>>> permitting disjoint or overlapping ranges, are these
> > >>>> common
> > >>>>>>>>>> enough
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>> practice to represent a meaningful bottleneck?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On 26 April 2023 01:40:14 BST, David Li <
> > >>>>> [email protected]
> > >>>>>>>>>>> <mailto:
> > >>>>>>>>>>>>> [email protected]>> wrote:
> > >>>>>>>>>>>>>>>>>> Is there a need for a 64-bit offsets version the
> > >> same way
> > >>>>> we
> > >>>>>>>>>> have
> > >>>>>>>>>>>>> List
> > >>>>>>>>>>>>>>>>> and LargeList?
> > >>>>>>>>>>>>>>>>>> And just to be clear, the difference with List is
> > >> that
> > >>>> the
> > >>>>>>> lists
> > >>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>> have to be stored in their logical order (or in other
> > >>>> words,
> > >>>>>>>>>> offsets
> > >>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>> have to be nondecreasing and so we also need sizes)?
> > >>>>>>>>>>>>>>>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > >>>>>>>>>>>>>>>>>>> For context, there was some discussion on this back
> > >> in
> > >>>>> [1].
> > >>>>>> At
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> time
> > >>>>>>>>>>>>>>>>>>> this was called "sequence view" but I do not like
> > >> that
> > >>>>> name.
> > >>>>>>>>>>>>>>> However,
> > >>>>>>>>>>>>>>>>>>> array-view array is a little confusing. Given this
> > >> is
> > >>>>>> similar
> > >>>>>>>>>> to
> > >>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>> we go with list-view array?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to
> > >> hear
> > >>>>>> about
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors,
> > >> and in
> > >>>>> what
> > >>>>>>>>>>>>>>>> situations
> > >>>>>>>>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > >> current
> > >>>>>>>>>> ListArray
> > >>>>>>>>>>>>>>>>>>>> implementations.
> > >>>>>>>>>>>>>>>>>>> I believe one significant benefit is that take (and
> > >> by
> > >>>>>> proxy,
> > >>>>>>>>>>>>>>> filter)
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> sort are O(# of items) with the proposed format and
> > >> O(#
> > >>>> of
> > >>>>>>>>>> bytes)
> > >>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> current format. Jorge did some profiling to this
> > >> effect
> > >>>> in
> > >>>>>>> [1].
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>
> > >>>>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<
> > >>>>>>>>>>>>>
> > >>>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
> > >>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones <
> > >>>>>>>>>>> [email protected]
> > >>>>>>>>>>>>> <mailto:[email protected]>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>> Hi Felipe,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to
> > >> hear
> > >>>>>> about
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors,
> > >> and in
> > >>>>> what
> > >>>>>>>>>>>>>>>> situations
> > >>>>>>>>>>>>>>>>> they
> > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > >> current
> > >>>>>>>>>> ListArray
> > >>>>>>>>>>>>>>>>>>>> implementations.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> IIUC it would be fairly cheap to transform a
> > >> ListArray
> > >>>> to
> > >>>>>> an
> > >>>>>>>>>>>>>>>>> ArrayView, but
> > >>>>>>>>>>>>>>>>>>>> expensive to go the other way.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Will Jones
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira
> > >>>> Carvalho
> > >>>>> <
> > >>>>>>>>>>>>>>>>>>>> [email protected]<mailto:[email protected]>>
> > >>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Hi folks,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> I would like to start a public discussion on the
> > >>>>> inclusion
> > >>>>>>>>>> of a
> > >>>>>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>> array
> > >>>>>>>>>>>>>>>>>>>>> format to Arrow — array-view array. The name is
> > >> also
> > >>>> up
> > >>>>>> for
> > >>>>>>>>>>>>>>> debate.
> > >>>>>>>>>>>>>>>>>>>>> This format is inspired by Velox's ArrayVector
> > >> format
> > >>>>> [1].
> > >>>>>>>>>>>>>>>> Logically,
> > >>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>> array represents an array of arrays. Each element
> > >> is
> > >>>> an
> > >>>>>>>>>>>>>>> array-view
> > >>>>>>>>>>>>>>>>>>>> (offset
> > >>>>>>>>>>>>>>>>>>>>> and size pair) that points to a range within a
> > >> nested
> > >>>>>>>>>> "values"
> > >>>>>>>>>>>>>>>> array
> > >>>>>>>>>>>>>>>>>>>>> (called "elements" in Velox docs). The nested
> > >> array
> > >>>> can
> > >>>>> be
> > >>>>>>> of
> > >>>>>>>>>>> any
> > >>>>>>>>>>>>>>>>> type,
> > >>>>>>>>>>>>>>>>>>>>> which makes this format very flexible and
> > >> powerful.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> [image: ../_images/array-vector.png]
> > >>>>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>> https://facebookincubator.github.io/velox/_images/array-vector.png
> > >>>>> <
> > >>>>>>>>>>>>>
> > >>>>>>
> https://facebookincubator.github.io/velox/_images/array-vector.png
> > >>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> I'm currently working on a C++ implementation and
> > >> plan
> > >>>>> to
> > >>>>>>>>>> work
> > >>>>>>>>>>>>>>> on a
> > >>>>>>>>>>>>>>>>> Go
> > >>>>>>>>>>>>>>>>>>>>> implementation to fulfill the two-implementations
> > >>>>>>> requirement
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>>>>>>>> changes.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> The draft design:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets
> > >> buffer,
> > >>>>> int32
> > >>>>>>>>>> sizes
> > >>>>>>>>>>>>>>>>> buffer]
> > >>>>>>>>>>>>>>>>>>>>> - 1 child array: "values" as an array of the type
> > >>>>>> parameter
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> validity_bitmap is used to differentiate between
> > >> empty
> > >>>>>> array
> > >>>>>>>>>>>>>>> views
> > >>>>>>>>>>>>>>>>>>>>> (sizes[i] == 0) and NULL array views
> > >>>> (validity_bitmap[i]
> > >>>>>> ==
> > >>>>>>>>>> 0).
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> When the validity_bitmap[i] is 0, both sizes and
> > >>>> offsets
> > >>>>>> are
> > >>>>>>>>>>>>>>>>> undefined
> > >>>>>>>>>>>>>>>>>>>> (as
> > >>>>>>>>>>>>>>>>>>>>> usual), and when sizes[i] == 0, offsets[i] is
> > >>>>> undefined. 0
> > >>>>>>> is
> > >>>>>>>>>>>>>>>>> recommended
> > >>>>>>>>>>>>>>>>>>>>> if setting a value is not an issue to the system
> > >>>>> producing
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> arrays.
> > >>>>>>>>>>>>>>>>>>>>> offsets buffer is not required to be ordered and
> > >> views
> > >>>>>> don't
> > >>>>>>>>>>> have
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>> disjoint.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > >>>>>>>>>>>>> <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>> Felipe O. Carvalho
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> > >
> >
> >
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to