Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Andrew Lamb Wed, 14 Jun 2023 11:07:27 -0700

Arrow has at least 7 native "official" implementations (Java, Rust, Golang,
C#, Javascript, Julia and C++), 5 bindings on C++ (C, Ruby, Python, R, and
Matlab) and likely other implementations (like arrow2 in rust)


I think it is worth remembering that depending on what level of support
ListView aspires to, such an addition could require non trivial changes to
many / all of those implementations (and the APIs they expose).

Andrew

On Wed, Jun 14, 2023 at 12:53 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> General approach to alternative formats aside, in the specific case of
> ListView, I think the implementation complexity is being overestimated in
> these discussions.
>
> The C++ Arrow implementation shares a lot of code between List and
> LargeList. And with some tweaks, I'm able to share that common
> infrastructure for ListView as well. [1]
>
> ListView is similar to list: it doesn't require offsets to be sorted and
> adds an extra buffer containing sizes. For symmetry with the List and
> LargeList types (FixedSizeList not included), I'm going to propose we add a
> LargeListView. That is not part of the draft implementation yet, but seems
> like an obvious thing to have now that I implemented the `if_else`
> specialization. [2] David Li asked about this above and I can confirm now
> that 64-bit version of ListView (LargeListView) is in the plans.
>
> Trying to avoid re-implementing some kernels is not a good goal to chase,
> IMO, because kernels need tweaks to take advantage of the format.
>
> [1] https://github.com/apache/arrow/pull/35345
> [2] https://github.com/felipecrv/arrow/commits/list_view_backup
> --
> Felipe
>
> On Wed, Jun 14, 2023 at 12:08 PM Weston Pace <weston.p...@gmail.com>
> wrote:
>
> > > perhaps we could support this use-case as
> > > a canonical extension type over dictionary encoded, variable-sized
> > > arrays
> >
> > I believe this suggestion is valid and could be used to solve the if-else
> > case.  The algorithm, if I understand it, would be roughly:
> >
> > ```
> > // Note: Simple pseudocode, vectorization left as exercise for the reader
> > auto indices_builder = ...
> > auto list_builder = ...
> > indices_builder.resize(batch.length);
> > Array condition_mask = condition.EvaluateBatch(batch);
> > for row_index in selected_rows(condition_mask):
> >   indices_builder[row_index] = list_builder.CurrentLength();
> >   list_builder.Append(if_expr.EvaluateRow(batch, row_index))
> > for row_index in unselected_rows(condition_mask):
> >   indices_builder[row_index] = list_builder.CurrentLength();
> >   list_builder.Append(else_expr.EvaluateRow(batch, row_index))
> > return DictionaryArray(indices_builder.Finish(), list_builder.Finish())
> > ```
> >
> > I also agree this is a slightly awkward use of dictionaries (e.g. the
> > dictionary would have the same size as the # of indices) and perhaps not
> > the most intuitive way to solve the problem.
> >
> > My gut reaction is simply "an improved if/else kernel is not, alone,
> enough
> > justification for a new layout" and yet...
> >
> > I think we are seeing the start (I hope) of a trend where Arrow is not
> just
> > used "between systems" (e.g. to shuttle data from one place to another,
> or
> > between a query engine and a visualization tool) but also "within
> systems"
> > (e.g. UDFs, bespoke file formats and temporary tables, between workers
> in a
> > distributed query engine).  When arrow is used "within systems" I think
> > both the number of bespoke formats and the significance of conversion
> cost
> > increases.  For example, it's easy to say that Velox should convert at
> the
> > boundary as data leaves Velox.  But what if Velox (or datafusion or ...)
> > were to define an interface for UDFs.  Would we want to use Arrow there
> > (e.g. the C data interface is a good fit)?  If so, wouldn't the
> conversion
> > cost be more significant?
> >
> > > Also, I'm very lukewarm towards the concept of "alternative layouts"
> > > suggested somewhere else in this thread. It does not seem a good choice
> > > to complexify the Arrow format that much.
> >
> > I think, in my opinion, this depends on how many of these alternative
> > layouts exist.  If there are just a few, then I agree, we should just
> adopt
> > them as formal first-class layouts.  If there are many, then I think it
> > will be too much complexity in Arrow to have all the different choices.
> > Or, we could say there are many, but the alternatives don't belong in
> Arrow
> > at all.  In that case I think it's the same question as the above
> > paragraph, "do we want Arrow to be used within systems?  Or just between
> > systems?"
> >
> >
> > On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > I agree that ListView cannot be an extension type, given that it
> > > features a new layout, and therefore cannot reasonably be backed by an
> > > existing storage type (AFAICT).
> > >
> > > Also, I'm very lukewarm towards the concept of "alternative layouts"
> > > suggested somewhere else in this thread. It does not seem a good choice
> > > to complexify the Arrow format that much.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 07/06/2023 à 00:21, Felipe Oliveira Carvalho a écrit :
> > > > +1 on what Ian said.
> > > >
> > > > And as I write kernels for this new format, I’m learning that it’s
> > > possible
> > > > to re-use the common infrastructure used by List and LargeList to
> > > implement
> > > > the ListView related features with some adjustments.
> > > >
> > > > IMO having this format as a second-class citizen would more likely
> > > > complicate things because it would make this unification harder.
> > > >
> > > > —
> > > > Felipe
> > > >
> > > > On Tue, 6 Jun 2023 at 18:45 Ian Cook <ianmc...@apache.org> wrote:
> > > >
> > > >> To clarify why we cannot simply propose adding ListView as a new
> > > >> “canonical extension type”: The extension type mechanism in Arrow
> > > >> depends on the underlying data being organized in an existing Arrow
> > > >> layout—that way an implementation that does not support the
> extension
> > > >> type can still handle the underlying data. But ListView is a wholly
> > > >> new layout.
> > > >>
> > > >> I strongly agree with Weston’s idea that it is a good time for Arrow
> > > >> to introduce the notion of “canonical alternative layouts.”
> > > >>
> > > >> Taken together, I think that canonical extension types and canonical
> > > >> alternative layouts could serve as an “incubator” for proposed new
> > > >> representations. For example, if a proposed canonical alternative
> > > >> layout ends up being broadly adopted, then that will serve as a
> signal
> > > >> that we should consider adding it as a primary layout in the core
> > > >> spec.
> > > >>
> > > >> It seems to me that most projects that are implementing Arrow today
> > > >> are not aiming to provide complete coverage of Arrow; rather they
> are
> > > >> adopting Arrow because of its role as a standard and they are
> > > >> implementing only as much of the Arrow standard as they require to
> > > >> achieve some goal. I believe that such projects are important Arrow
> > > >> stakeholders, and I believe that this proposed notion of canonical
> > > >> alternative layouts will serve them well and will create
> efficiencies
> > > >> by standardizing implementations around a shared set of
> alternatives.
> > > >>
> > > >> However I think that the documentation for canonical alternative
> > > >> layouts should strongly encourage implementers to default to using
> the
> > > >> primary layouts defined in the core spec and only use alternative
> > > >> layouts in cases where the primary layouts do not meet their needs.
> > > >>
> > > >>
> > > >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>> This sounds reasonable to me but my main concern is, I'm not sure
> > there
> > > >> is
> > > >>> a great mechanism to enforce canonical layouts don't somehow become
> > > >> default
> > > >>> (or the only implementation).
> > > >>>
> > > >>> Even for these new layouts, I think it might be worth rethinking
> > > binding
> > > >> a
> > > >>> layout into the schema versus having a different concept of
> encoding
> > > (and
> > > >>> changing some of the corresponding data structures).
> > > >>>
> > > >>>
> > > >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace <
> weston.p...@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>>> Trying to settle on one option is a fruitless endeavor.  Each type
> > has
> > > >> pros
> > > >>>> and cons.  I would also predict that the largest existing usage of
> > > >> Arrow is
> > > >>>> shuttling data from one system to another.  The newly proposed
> > format
> > > >>>> doesn't appear to have any significant advantage for that use case
> > (if
> > > >>>> anything, the existing format is arguably better as it is more
> > > >> compact).
> > > >>>>
> > > >>>> I am very biased towards historical precedent and avoiding
> breaking
> > > >>>> changes.
> > > >>>>
> > > >>>> We have "canonical extension types", perhaps it is time for
> > "canonical
> > > >>>> alternative layouts".  We could define it as such:
> > > >>>>
> > > >>>>   * There are one or more primary layouts
> > > >>>>     * Existing layouts are automatically considered primary
> layouts,
> > > >> even if
> > > >>>> they wouldn't
> > > >>>>       have been primary layouts initially (e.g. large list)
> > > >>>>   * A new layout, if it is semantically equivalent to another, is
> > > >> considered
> > > >>>> an alternative layout
> > > >>>>   * An alternative layout still has the same requirements for
> > adoption
> > > >> (two
> > > >>>> implementations and a vote)
> > > >>>>     * An implementation should not feel pressured to rush and
> > > implement
> > > >> the
> > > >>>> new layout.
> > > >>>>       It would be good if they contribute in the discussion and
> > > >> consider the
> > > >>>> layout and vote
> > > >>>>       if they feel it would be an acceptable design.
> > > >>>>   * We can define and vote and approve as many canonical
> alternative
> > > >> layouts
> > > >>>> as we want:
> > > >>>>     * A canonical alternative layout should, at a minimum, have
> some
> > > >>>>       reasonable justification, such as improved performance for
> > > >> algorithm X
> > > >>>>   * Arrow implementations MUST support the primary layouts
> > > >>>>   * An Arrow implementation MAY support a canonical alternative,
> > > >> however:
> > > >>>>     * An Arrow implementation MUST first support the primary
> layout
> > > >>>>     * An Arrow implementation MUST support conversion to/from the
> > > >> primary
> > > >>>> and canonical layout
> > > >>>>     * An Arrow implementation's APIs MUST only provide data in the
> > > >>>> alternative
> > > >>>>       layout if it is explicitly asked for (e.g. schema inference
> > > should
> > > >>>> prefer the primary layout).
> > > >>>>   * We can still vote for new primary layouts (e.g. promoting a
> > > >> canonical
> > > >>>> alternative) but, in these
> > > >>>>      votes we don't only consider the value (e.g. performance) of
> > the
> > > >> layout
> > > >>>> but also the interoperability.
> > > >>>>      In other words, a layout can only become a primary layout if
> > > there
> > > >> is
> > > >>>> significant evidence that most
> > > >>>>      implementations plan to adopt it.
> > > >>>>
> > > >>>> This lets us evolve support for new layouts more naturally.  We
> can
> > > >>>> generally assume that users will not, initially, be aware of these
> > > >>>> alternative layouts.  However, everything will just work.  They
> may
> > > >> start
> > > >>>> to see a performance penalty stemming from a lack of support for
> > these
> > > >>>> layouts.  If this performance penalty becomes significant then
> they
> > > >> will
> > > >>>> discover it and become aware of the problem.  They can then ask
> > > >> whatever
> > > >>>> library they are using to add support for the alternative layout.
> > As
> > > >>>> enough users find a need for it then libraries will add support.
> > > >>>> Eventually, enough libraries will support it that we can adopt it
> > as a
> > > >>>> primary layout.
> > > >>>>
> > > >>>> Also, it allows libraries to adopt alternative layouts more
> > > >> aggressively if
> > > >>>> they would like while still hopefully ensuring that we eventually
> > all
> > > >>>> converge on the same implementation of the alternative layout.
> > > >>>>
> > > >>>> On Mon, May 22, 2023 at 9:35 AM Will Jones <
> will.jones...@gmail.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Arrow devs,
> > > >>>>>
> > > >>>>> I don't understand why we would start deprecating features in the
> > > >> Arrow
> > > >>>>>> format. Even starting this talk might already be a bad idea
> > > >> PR-wise.
> > > >>>>>>
> > > >>>>>
> > > >>>>> I agree we don't want to make breaking changes to the Arrow
> format.
> > > >> But
> > > >>>>> several maintainers have already stated they have no interest in
> > > >>>>> maintaining both list types with full compute functionality
> [1][2],
> > > >> so I
> > > >>>>> think it's very likely one list type or the other will be
> > > >>>>> implicitly preferred in the ecosystem if this data type was
> added.
> > > >> If
> > > >>>>> that's the case, I'd prefer that we agreed as a community which
> one
> > > >>>> should
> > > >>>>> be preferred. Maybe that's not the best path; it's just one way
> for
> > > >> us to
> > > >>>>> balance stability, maintenance burden, and relevance.
> > > >>>>>
> > > >>>>> Can someone help distill down the primary rationale and usecase
> for
> > > >>>>>> adding ArrayView to the Arrow Spec?
> > > >>>>>>
> > > >>>>>
> > > >>>>> Looking back at that old thread, I think one of the main
> > motivations
> > > >> is
> > > >>>> to
> > > >>>>> try to prevent query engine implementers from feeling there is a
> > > >> tradeoff
> > > >>>>> between having state-of-the-art performance and being
> Arrow-native.
> > > >> For
> > > >>>>> some of the new array types, we had both Velox and DuckDB use
> them,
> > > >> so it
> > > >>>>> was reasonable to expect they were innovations that might
> > > >> proliferate.
> > > >>>> I'm
> > > >>>>> not sure if the ArrayView is part of that. From Wes earlier [3]:
> > > >>>>>
> > > >>>>> The idea is that in a world of data and query federation (for
> > > >> example,
> > > >>>>>> consider [1] where Arrow is being used as a data federation
> layer
> > > >> with
> > > >>>>> many
> > > >>>>>> query engines), we want to increase the amount of data in-flight
> > > >> and
> > > >>>>>> in-memory that is in Arrow format. So if query engines are
> having
> > > >> to
> > > >>>>> depart
> > > >>>>>> substantially from the Arrow format to get performance, then
> this
> > > >>>>> creates a
> > > >>>>>> potential lose-lose situation: * Depart from Arrow: get better
> > > >>>>> performance
> > > >>>>>> but pay serialization costs to read and write Arrow (the
> > > >> performance
> > > >>>> and
> > > >>>>>> resource utilization benefits outweigh the serialization costs).
> > > >> This
> > > >>>>> puts
> > > >>>>>> additional pressure on query engines to build specialized
> > > >> components
> > > >>>> for
> > > >>>>>> solving problems rather than making use of off-the-shelf
> > components
> > > >>>> that
> > > >>>>>> use Arrow. This has knock-on effects on ecosystem
> fragmentation. *
> > > >> Or
> > > >>>> use
> > > >>>>>> Arrow, and accept suboptimal query processing performance
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Will mentions one usecase is Velox consuming python UDF output,
> > which
> > > >>>> seems
> > > >>>>>> to be mostly about how fast Velox can consume this format, not
> how
> > > >> fast
> > > >>>>> it
> > > >>>>>> can be written. Are there other usecases?
> > > >>>>>>
> > > >>>>>
> > > >>>>> To be clear, I don't know if that's the use case they want.
> That's
> > > >> just
> > > >>>> me
> > > >>>>> speculating.
> > > >>>>>
> > > >>>>> I still have some questions myself:
> > > >>>>>
> > > >>>>> 1. Is this array type currently only used in Velox? (not DuckDB
> > like
> > > >> some
> > > >>>>> of the other new types?) What evidence do we have that it will
> > become
> > > >>>> used
> > > >>>>> outside of Velox?
> > > >>>>> 2. We already have three list types: list, large list (64-bit
> > > >> offsets),
> > > >>>> and
> > > >>>>> fixed size list. Do we think we will only want a view version of
> > the
> > > >>>> 32-bit
> > > >>>>> offset variable length list? Or are we potentially talking about
> > view
> > > >>>>> variants for all three?
> > > >>>>>
> > > >>>>> Best,
> > > >>>>>
> > > >>>>> Will Jones
> > > >>>>>
> > > >>>>> [1]
> > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > > >>>>> [2]
> > https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3
> > > >>>>> [3]
> > https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t
> > > >>>>>
> > > >>>>> On Mon, May 22, 2023 at 3:48 AM Andrew Lamb <
> al...@influxdata.com>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>>> Can someone help distill down the primary rationale and usecase
> > for
> > > >>>>>> adding ArrayView to the Arrow Spec?
> > > >>>>>>
> > > >>>>>>  From the above discussions, the stated rationale seems to be
> fast
> > > >>>>>> (zero-copy) interchange with Velox.
> > > >>>>>>
> > > >>>>>> This thread has qualitatively enumerated the benefits of
> > > >> (offset+len)
> > > >>>>>> encoding over the existing Arrow ListArray (offets) approach,
> but
> > I
> > > >>>>> haven't
> > > >>>>>> seen any performance measurements that might help us to gauge
> the
> > > >>>>> tradeoff
> > > >>>>>> in additional complexity vs runtime overhead.
> > > >>>>>>
> > > >>>>>> Will mentions one usecase is Velox consuming python UDF output,
> > > >> which
> > > >>>>> seems
> > > >>>>>> to be mostly about how fast Velox can consume this format, not
> how
> > > >> fast
> > > >>>>> it
> > > >>>>>> can be written. Are there other usecases?
> > > >>>>>>
> > > >>>>>> Do we have numbers showing how much overhead converting to /from
> > > >>>> Velox's
> > > >>>>>> internal representation and the existing ListArray adds? Has
> > > >> anyone in
> > > >>>>>> Velox land considered adding faster support for Arrow style
> > > >> ListArray
> > > >>>>>> encoding?
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Andrew
> > > >>>>>>
> > > >>>>>> On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou <
> > anto...@python.org
> > > >>>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> I don't understand why we would start deprecating features in
> the
> > > >>>> Arrow
> > > >>>>>>> format. Even starting this talk might already be a bad idea
> > > >> PR-wise.
> > > >>>>>>>
> > > >>>>>>> As for implementing conversions at the I/O boundary, it's a
> > > >>>> reasonably
> > > >>>>>>> policy, but it still requires work by implementors and it's not
> > > >>>> granted
> > > >>>>>>> that all consumers of the Arrow format will grow such
> conversions
> > > >>>>>>> if/when we add non-trivial types such as ListView or
> StringView.
> > > >>>>>>>
> > > >>>>>>> Regards
> > > >>>>>>>
> > > >>>>>>> Antoine.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Le 22/05/2023 à 00:39, Will Jones a écrit :
> > > >>>>>>>> One more thing: Looking back on the previous discussion[1]
> > > >> (which
> > > >>>>>> Weston
> > > >>>>>>>> pointed out in his earlier message), Jorge suggested that the
> > > >> old
> > > >>>>> list
> > > >>>>>>>> types might be deprecated in favor of view variants [2].
> Others
> > > >>>> were
> > > >>>>>>>> worried that it might undermine the perception that the Arrow
> > > >>>> format
> > > >>>>> is
> > > >>>>>>>> stable. I think it might be worth thinking about "soft
> > > >> deprecating"
> > > >>>>> the
> > > >>>>>>> old
> > > >>>>>>>> list type: suggesting new implementations prefer the list
> > > >> view, but
> > > >>>>>>>> reassuring that implementations should support the old format,
> > > >> even
> > > >>>>> if
> > > >>>>>>> they
> > > >>>>>>>> just convert to the new format. To be clear, this wouldn't
> > > >> mean we
> > > >>>>> plan
> > > >>>>>>> to
> > > >>>>>>>> create breaking changes in the format; but if we ever did for
> > > >> other
> > > >>>>>>>> reasons, the old list type might go.
> > > >>>>>>>>
> > > >>>>>>>> Arrow compute libraries could choose either format for compute
> > > >>>>> support,
> > > >>>>>>> and
> > > >>>>>>>> plan to do conversion at the boundaries. Libraries that use
> > > >> the new
> > > >>>>>> type
> > > >>>>>>>> will have cheap conversion when reading the old type.
> Meanwhile
> > > >>>> those
> > > >>>>>>> that
> > > >>>>>>>> are still on the old type will have some incentive to move
> > > >> towards
> > > >>>>> the
> > > >>>>>>> new
> > > >>>>>>>> one, since that conversion will not be as efficient.
> > > >>>>>>>>
> > > >>>>>>>> [1]
> > > >>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > >>>>>>>> [2]
> > > >>>> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > > >>>>>>>>
> > > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones <
> > > >>>> will.jones...@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hello,
> > > >>>>>>>>>
> > > >>>>>>>>> I think Sasha brings up a good point, that the advantages of
> > > >> this
> > > >>>>>> format
> > > >>>>>>>>> seem to be primarily about query processing. Other encodings
> > > >> like
> > > >>>>> REE
> > > >>>>>>> and
> > > >>>>>>>>> dictionary have space-saving advantages that justify them
> > > >> simply
> > > >>>> in
> > > >>>>>>> terms
> > > >>>>>>>>> of space efficiency (although they have query processing
> > > >>>> advantages
> > > >>>>> as
> > > >>>>>>>>> well). As discussed, most use cases are already well served
> by
> > > >>>>>> existing
> > > >>>>>>>>> list types and dictionary encoding.
> > > >>>>>>>>>
> > > >>>>>>>>> I agree that there are cases where transferring this type
> > > >> without
> > > >>>>>>>>> conversion would be ideal. One use case I can think of is if
> > > >> Velox
> > > >>>>>>> wants to
> > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with
> > > >> PyArrow,
> > > >>>> for
> > > >>>>>>>>> example) that operate on this column type and therefore wants
> > > >>>>>> zero-copy
> > > >>>>>>>>> exchange over the C Data Interface.
> > > >>>>>>>>>
> > > >>>>>>>>> One big question I have: we already have three list types:
> > > >> list,
> > > >>>>> large
> > > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we
> > > >> will
> > > >>>> only
> > > >>>>>>> want a
> > > >>>>>>>>> view version of the 32-bit offset variable length list? Or
> > > >> are we
> > > >>>>>>>>> potentially talking about view variants for all three?
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>>
> > > >>>>>>>>> Will Jones
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> > > >>>>>>>>> felipe...@gmail.com> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> The benefit of having a memory format that’s friendly to
> > > >>>>>>> non-deterministic
> > > >>>>>>>>>> order writes is unlocked by the transport and processing of
> > > >> the
> > > >>>>> data
> > > >>>>>>> being
> > > >>>>>>>>>> agnostic to the physical order as much as possible.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But it
> > > >> can
> > > >>>>> be a
> > > >>>>>>>>>> provisory step for compatibility between systems that don’t
> > > >>>>>> understand
> > > >>>>>>> the
> > > >>>>>>>>>> format yet. This is similar to the situation with
> compression
> > > >>>>> schemes
> > > >>>>>>> like
> > > >>>>>>>>>> run-end encoding — the goal is processing the compressed
> data
> > > >>>>>> directly
> > > >>>>>>>>>> without an expansion step whenever possible.
> > > >>>>>>>>>>
> > > >>>>>>>>>> This is why having it as part of the open Arrow format is so
> > > >>>>>> important:
> > > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel
> > > >> and/or
> > > >>>>>>>>>> vectorized compute kernels without introducing multiple
> > > >>>>> incompatible
> > > >>>>>>>>>> formats to the ecosystem and without imposing a conversion
> > > >> step
> > > >>>>>> between
> > > >>>>>>>>>> the
> > > >>>>>>>>>> different systems.
> > > >>>>>>>>>>
> > > >>>>>>>>>> —
> > > >>>>>>>>>> Felipe
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin
> > > >> <octalene....@pm.me.invalid>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> I don't feel like this representation is necessarily a
> > > >> detail of
> > > >>>>> the
> > > >>>>>>>>>> query
> > > >>>>>>>>>>> engine, but I am also not sure why this representation
> would
> > > >>>> have
> > > >>>>> to
> > > >>>>>>> be
> > > >>>>>>>>>>> converted to a non-view format when serializing. Could you
> > > >>>> clarify
> > > >>>>>>>>>> that? My
> > > >>>>>>>>>>> impression is that this representation could be used for
> > > >>>>> persistence
> > > >>>>>>> or
> > > >>>>>>>>>>> data transfer, though it can be more complex to guarantee
> > > >> the
> > > >>>>>> portion
> > > >>>>>>> of
> > > >>>>>>>>>>> the buffer that an index points to is also present in
> > > >> memory.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Sent from Proton Mail for iOS
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > >>>>>>>>>> krassovskysa...@gmail.com
> > > >>>>>>>>>>>
> > > >> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>> I understand that there are numerous benefits to this
> > > >>>>> representation
> > > >>>>>>>>>>> during query processing, but would it be fair to say that
> > > >> this
> > > >>>> is
> > > >>>>> an
> > > >>>>>>>>>>> implementation detail of the query engine? Query engines
> > > >> don’t
> > > >>>>>>>>>> necessarily
> > > >>>>>>>>>>> need to conform to the Arrow format internally, only at
> > > >>>>>> ingest/egress
> > > >>>>>>>>>>> points, and performing a conversion from the non-view to
> > > >> view
> > > >>>>> format
> > > >>>>>>>>>> seems
> > > >>>>>>>>>>> like it would be very cheap (though I understand not
> > > >> necessarily
> > > >>>>> the
> > > >>>>>>>>>> other
> > > >>>>>>>>>>> way around, but you’d need to do that anyway if you’re
> > > >>>>> serializing).
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Sasha Krassovsky
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> 20 мая 2023 г., в 13:00, Will Jones <
> > > >> will.jones...@gmail.com>
> > > >>>>>>>>>>> написал(а):
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks for sharing these details, Pedro. The conditional
> > > >>>>> branches
> > > >>>>>>>>>>> argument
> > > >>>>>>>>>>>> makes a lot of sense to me.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The tensors point brings up some interesting issues. For
> > > >> now,
> > > >>>>> we've
> > > >>>>>>>>>>> defined
> > > >>>>>>>>>>>> our only tensor extension type to be built on a fixed size
> > > >>>> list.
> > > >>>>>> If a
> > > >>>>>>>>>> use
> > > >>>>>>>>>>>> case of this might be manipulating tensors with zero copy,
> > > >>>>> perhaps
> > > >>>>>>>>>> that
> > > >>>>>>>>>>>> suggests that we want a fixed size list variant? In
> > > >> addition,
> > > >>>>> would
> > > >>>>>>> we
> > > >>>>>>>>>>> have
> > > >>>>>>>>>>>> to define another extension type to be a ListView variant?
> > > >> Or
> > > >>>>> would
> > > >>>>>>> we
> > > >>>>>>>>>>> want
> > > >>>>>>>>>>>> to think about making extension types somehow valid across
> > > >>>>> various
> > > >>>>>>>>>>>> encodings of the same "logical type"?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha
> > > >> Pedreira
> > > >>>>>>>>>>>>> <pedro...@meta.com.invalid> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> This is Pedro from the Velox team at Meta. This is my
> > > >> first
> > > >>>> time
> > > >>>>>>>>>> here,
> > > >>>>>>>>>>> so
> > > >>>>>>>>>>>>> nice to e-meet you!
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Adding to what Felipe said, the main reason we created
> > > >>>>> “ListView”
> > > >>>>>>>>>>> (though
> > > >>>>>>>>>>>>> we just call them ArrayVector/MapVector in Velox) is
> that,
> > > >>>> along
> > > >>>>>>> with
> > > >>>>>>>>>>>>> StringViews for strings, they allow us to write any
> > > >> columnar
> > > >>>>>> buffer
> > > >>>>>>>>>>>>> out-or-order, regardless of their types or encodings.
> > > >> This is
> > > >>>>>>>>>> naturally
> > > >>>>>>>>>>>>> doable for all primitive types (fixed-size), but not for
> > > >> types
> > > >>>>>> that
> > > >>>>>>>>>>> don’t
> > > >>>>>>>>>>>>> have fixed size and are required to be contiguous. The
> > > >>>>> StringView
> > > >>>>>>> and
> > > >>>>>>>>>>>>> ListView formats allow us to keep this invariant in
> Velox.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Being able to write vectors out-of-order is useful when
> > > >>>>> executing
> > > >>>>>>>>>>>>> conditionals like IF/SWITCH statements, which are
> > > >> pervasive
> > > >>>>> among
> > > >>>>>>> our
> > > >>>>>>>>>>>>> workloads. To fully vectorize it, one first evaluates the
> > > >>>>>>> expression,
> > > >>>>>>>>>>> then
> > > >>>>>>>>>>>>> generate a bitmap containing which rows take the THEN and
> > > >>>> which
> > > >>>>>> take
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>> ELSE branch. Then you populate all rows that match the
> > > >> first
> > > >>>>>> branch
> > > >>>>>>>>>> by
> > > >>>>>>>>>>>>> evaluating the THEN expression in a vectorized
> > > >> (branch-less
> > > >>>> and
> > > >>>>>>> cache
> > > >>>>>>>>>>>>> friendly) way, and subsequently the ELSE branch. If you
> > > >> can’t
> > > >>>>>> write
> > > >>>>>>>>>> them
> > > >>>>>>>>>>>>> out-of-order, you would either have a big branch per row
> > > >>>>>> dispatching
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> right expression (slow), or populate two distinct vectors
> > > >> then
> > > >>>>>>>>>> merging
> > > >>>>>>>>>>> them
> > > >>>>>>>>>>>>> at the end (potentially even slower). How much faster our
> > > >>>>> approach
> > > >>>>>>> is
> > > >>>>>>>>>>>>> highly depends on the buffer sizes and expressions, but
> we
> > > >>>> found
> > > >>>>>> it
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> be
> > > >>>>>>>>>>>>> faster enough on average to justify us extending the
> > > >>>> underlying
> > > >>>>>>>>>> layout.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> With that said, this is all within a single thread of
> > > >>>> execution.
> > > >>>>>>>>>>>>> Parallelization is done by feeding each thread its own
> > > >>>>>> vector/data.
> > > >>>>>>>>>> As
> > > >>>>>>>>>>>>> pointed out in a previous message, this also gives you
> the
> > > >>>>>>>>>> flexibility
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> implement cardinality increasing/reducing operations, but
> > > >> we
> > > >>>>> don’t
> > > >>>>>>>>>> use
> > > >>>>>>>>>>> it
> > > >>>>>>>>>>>>> for that purpose. Operations like filtering, joining,
> > > >>>> unnesting
> > > >>>>>> and
> > > >>>>>>>>>>> similar
> > > >>>>>>>>>>>>> are done by wrapping the internal vector in a dictionary,
> > > >> as
> > > >>>>> these
> > > >>>>>>>>>> need
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> work not only on “ListViews” but on any data types with
> > > >> any
> > > >>>>>>> encoding.
> > > >>>>>>>>>>> There
> > > >>>>>>>>>>>>> are more details on Section 4.2.1 in [1]
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Beyond this, it also gives function/kernel developers
> more
> > > >>>>>>>>>> flexibility
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> implement operations that manipulate Arrays/Maps. For
> > > >> example,
> > > >>>>>>>>>>> operations
> > > >>>>>>>>>>>>> that slice these containers can be implemented in a
> > > >> zero-copy
> > > >>>>>> manner
> > > >>>>>>>>>> by
> > > >>>>>>>>>>>>> just rearranging the lengths/offsets indices, without
> ever
> > > >>>>>> touching
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>> larger internal buffers. This is a similar motivation as
> > > >> for
> > > >>>>>>>>>> StringView
> > > >>>>>>>>>>>>> (think of substr(), trim(), and similar). One nice last
> > > >>>> property
> > > >>>>>> is
> > > >>>>>>>>>> that
> > > >>>>>>>>>>>>> this layout allows for overlapping ranges. This is
> > > >> something
> > > >>>>>>>>>> discussed
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>>>> our ML people to allow deduping feature values in a
> tensor
> > > >>>>> (which
> > > >>>>>> is
> > > >>>>>>>>>>> fairly
> > > >>>>>>>>>>>>> common), but not something we have leveraged just yet.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>> --
> > > >>>>>>>>>>>>> Pedro Pedreira
> > > >>>>>>>>>>>>> ________________________________
> > > >>>>>>>>>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com>
> > > >>>>>>>>>>>>> Sent: Friday, May 19, 2023 10:01 AM
> > > >>>>>>>>>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org>
> > > >>>>>>>>>>>>> Cc: Pedro Eugenio Rocha Pedreira <pedro...@meta.com>
> > > >>>>>>>>>>>>> Subject: Re: [DISCUSS][Format] Starting the draft
> > > >>>> implementation
> > > >>>>>> of
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>> ArrayView array format
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael
> > > >> Taylor-Davies
> > > >>>>> <r.
> > > >>>>>>>>>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi All, >
> > > >> if
> > > >>>> we
> > > >>>>>>> added
> > > >>>>>>>>>>>>> this, do we think many Arrow and query > engine
> > > >>>> implementations
> > > >>>>>> (for
> > > >>>>>>>>>>>>> example, DataFusion) will be
> > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerStart
> > > >>>>>>>>>>>>> This Message Is From an External Sender
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerEnd
> > > >>>>>>>>>>>>> +pedroerp
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
> > > >>>>>>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote:
> > > >>>>>>>>>>>>> Hi All,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> if we added this, do we think many Arrow and query
> > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be
> > > >>>> eager
> > > >>>>> to
> > > >>>>>>>>>> add
> > > >>>>>>>>>>>>> full
> > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are
> > > >> they
> > > >>>>>> likely
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>> just
> > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > > >>>>>>>>>>>>> I can't speak for query engines in general, but at least
> > > >> for
> > > >>>>>>> arrow-rs
> > > >>>>>>>>>>>>> and by extension DataFusion, and based on my current
> > > >>>>> understanding
> > > >>>>>>> of
> > > >>>>>>>>>>>>> the use-cases I would be rather hesitant to add support
> > > >> to the
> > > >>>>>>>>>> kernels
> > > >>>>>>>>>>>>> for this array type, definitely instead favouring
> > > >> conversion
> > > >>>> at
> > > >>>>>> the
> > > >>>>>>>>>>>>> edges. We already have issues with the amount of code
> > > >>>> generation
> > > >>>>>>>>>>>>> resulting in binary bloat and long compile times, and I
> > > >> worry
> > > >>>>> this
> > > >>>>>>>>>> would
> > > >>>>>>>>>>>>> worsen this situation whilst not really providing
> > > >> compelling
> > > >>>>>>>>>> advantages
> > > >>>>>>>>>>>>> for the vast majority of workloads that don't interact
> > > >> with
> > > >>>>> Velox.
> > > >>>>>>>>>>>>> Whilst I can definitely see that the ListView
> > > >> representation
> > > >>>> is
> > > >>>>>>>>>> probably
> > > >>>>>>>>>>>>> a better way to represent variable length lists than what
> > > >>>> arrow
> > > >>>>>>>>>> settled
> > > >>>>>>>>>>>>> upon, I'm not yet convinced it is sufficiently better to
> > > >>>>>> incentivise
> > > >>>>>>>>>>>>> broad ecosystem adoption.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Kind Regards,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Raphael Taylor-Davies
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On 11/05/2023 21:20, Will Jones wrote:
> > > >>>>>>>>>>>>>> Hi Felipe,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Thanks for the additional details.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to
> > > >> the
> > > >>>>>> array
> > > >>>>>>>>>> from
> > > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > > >> Only the
> > > >>>>>>>>>> offsets
> > > >>>>>>>>>>>>> array
> > > >>>>>>>>>>>>>>> has to be written according to logical order but that
> is
> > > >>>>>>>>>> potentially a
> > > >>>>>>>>>>>>> much
> > > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> It still seems to me like applications are still pretty
> > > >>>> niche,
> > > >>>>>> as I
> > > >>>>>>>>>>>>> suspect
> > > >>>>>>>>>>>>>> in most cases the benefits are outweighed by the costs.
> > > >> The
> > > >>>>>> benefit
> > > >>>>>>>>>>> here
> > > >>>>>>>>>>>>>> seems pretty limited: if you are trying to split work
> > > >> between
> > > >>>>>>>>>> threads,
> > > >>>>>>>>>>>>>> usually you will have other levels such as array chunks
> > > >> to
> > > >>>>>>>>>> parallelize.
> > > >>>>>>>>>>>>> And
> > > >>>>>>>>>>>>>> if you have an incoming stream of row data, you'll want
> > > >> to
> > > >>>>> append
> > > >>>>>>> in
> > > >>>>>>>>>>>>>> predictable order to match the order of the other
> > > >> arrays. Am
> > > >>>> I
> > > >>>>>>>>>> missing
> > > >>>>>>>>>>>>>> something?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> And, IIUC, the cost of using ListView with out-of-order
> > > >>>> values
> > > >>>>>> over
> > > >>>>>>>>>>>>>> ListArray is you lose memory locality; the values of
> > > >> element
> > > >>>> 2
> > > >>>>>> are
> > > >>>>>>>>>> no
> > > >>>>>>>>>>>>>> longer adjacent to the values of element 1. What do you
> > > >> think
> > > >>>>>> about
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>>>> tradeoff?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I don't mean to be difficult about this. I'm excited for
> > > >> both
> > > >>>>> the
> > > >>>>>>>>>> REE
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>> StringView arrays, but this one I'm not so sure about
> > > >> yet. I
> > > >>>>>>> suppose
> > > >>>>>>>>>>>>> what I
> > > >>>>>>>>>>>>>> am trying to ask is, if we added this, do we think many
> > > >> Arrow
> > > >>>>> and
> > > >>>>>>>>>> query
> > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be
> > > >>>> eager
> > > >>>>> to
> > > >>>>>>>>>> add
> > > >>>>>>>>>>>>> full
> > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are
> > > >> they
> > > >>>>>> likely
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>> just
> > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Because if it turns out to be the latter, then we might
> > > >> as
> > > >>>> well
> > > >>>>>> ask
> > > >>>>>>>>>>> Velox
> > > >>>>>>>>>>>>>> to export this type as ListArray and save the rest of
> the
> > > >>>>>> ecosystem
> > > >>>>>>>>>>> some
> > > >>>>>>>>>>>>>> work.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Will Jones
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira
> > > >> Carvalho <
> > > >>>>>>>>>>>>>> felipe...@gmail.com<mailto:felipe...@gmail.com>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Initial reason for ListView arrays in Arrow is
> zero-copy
> > > >>>>>>>>>> compatibility
> > > >>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>> Velox which uses this format.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to
> > > >> the
> > > >>>>>> array
> > > >>>>>>>>>> from
> > > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > > >> Only the
> > > >>>>>>>>>> offsets
> > > >>>>>>>>>>>>> array
> > > >>>>>>>>>>>>>>> has to be written according to logical order but that
> is
> > > >>>>>>>>>> potentially a
> > > >>>>>>>>>>>>> much
> > > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Acero kernels could take advantage of that in the
> > > >> future.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> In implementing ListViewArray/Type I was able to reuse
> > > >> some
> > > >>>>> C++
> > > >>>>>>>>>>>>> templates
> > > >>>>>>>>>>>>>>> used for ListArray which can reduce some of the burden
> > > >> on
> > > >>>>> kernel
> > > >>>>>>>>>>>>>>> implementations that aim to work with all the types.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I’m can fix Acero kernels for working with ListView.
> > > >> This is
> > > >>>>>>>>>> similar
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> work I’ve doing in kernels dealing with run-end encoded
> > > >>>>> arrays.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> —
> > > >>>>>>>>>>>>>>> Felipe
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Wed, 26 Apr 2023 at 01:03 Will Jones <
> > > >>>>>> will.jones...@gmail.com
> > > >>>>>>>>>>>>> <mailto:will.jones...@gmail.com>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I suppose one common use case is materializing list
> > > >> columns
> > > >>>>>> after
> > > >>>>>>>>>>> some
> > > >>>>>>>>>>>>>>>> expanding operation like a join or unnest. That's a
> > > >> case
> > > >>>>> where
> > > >>>>>> I
> > > >>>>>>>>>>> could
> > > >>>>>>>>>>>>>>>> imagine a lot of repetition of values. Haven't yet
> > > >> thought
> > > >>>> of
> > > >>>>>>>>>> common
> > > >>>>>>>>>>>>>>> cases
> > > >>>>>>>>>>>>>>>> where there is overlap but not full duplication, but
> am
> > > >>>> eager
> > > >>>>>> to
> > > >>>>>>>>>> hear
> > > >>>>>>>>>>>>>>> any.
> > > >>>>>>>>>>>>>>>> The dictionary encoding point Raphael makes is
> > > >> interesting,
> > > >>>>>>>>>>> especially
> > > >>>>>>>>>>>>>>>> given the existence of LargeList and FixedSizeList.
> For
> > > >>>> many
> > > >>>>>>>>>>>>> operations,
> > > >>>>>>>>>>>>>>> it
> > > >>>>>>>>>>>>>>>> might make more sense to just compose those existing
> > > >> types.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> IIUC the operations that would be unique to the
> > > >> ArrayView
> > > >>>> are
> > > >>>>>>> ones
> > > >>>>>>>>>>>>>>> altering
> > > >>>>>>>>>>>>>>>> the shape. One could truncate each array to a certain
> > > >>>> length
> > > >>>>>>>>>> cheaply
> > > >>>>>>>>>>>>>>> simply
> > > >>>>>>>>>>>>>>>> by replacing the sizes buffer. Or perhaps there are
> > > >>>>> interesting
> > > >>>>>>>>>>>>>>> operations
> > > >>>>>>>>>>>>>>>> on tensors that would benefit.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
> > > >>>>>>>>>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Unless I am missing something, I think the selection
> > > >>>>> use-case
> > > >>>>>>>>>> could
> > > >>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>> equally well served by a dictionary-encoded
> > > >>>>>>> BinarArray/ListArray,
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>> have the benefit of not requiring any modifications
> > > >> to the
> > > >>>>>>>>>> existing
> > > >>>>>>>>>>>>>>>> format
> > > >>>>>>>>>>>>>>>>> or kernels.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> The major additional flexibility of the proposed
> > > >> encoding
> > > >>>>>> would
> > > >>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>> permitting disjoint or overlapping ranges, are these
> > > >>>> common
> > > >>>>>>>>>> enough
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>> practice to represent a meaningful bottleneck?
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On 26 April 2023 01:40:14 BST, David Li <
> > > >>>>> lidav...@apache.org
> > > >>>>>>>>>>> <mailto:
> > > >>>>>>>>>>>>> lidav...@apache.org>> wrote:
> > > >>>>>>>>>>>>>>>>>> Is there a need for a 64-bit offsets version the
> > > >> same way
> > > >>>>> we
> > > >>>>>>>>>> have
> > > >>>>>>>>>>>>> List
> > > >>>>>>>>>>>>>>>>> and LargeList?
> > > >>>>>>>>>>>>>>>>>> And just to be clear, the difference with List is
> > > >> that
> > > >>>> the
> > > >>>>>>> lists
> > > >>>>>>>>>>>>> don't
> > > >>>>>>>>>>>>>>>>> have to be stored in their logical order (or in other
> > > >>>> words,
> > > >>>>>>>>>> offsets
> > > >>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>> have to be nondecreasing and so we also need sizes)?
> > > >>>>>>>>>>>>>>>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > > >>>>>>>>>>>>>>>>>>> For context, there was some discussion on this back
> > > >> in
> > > >>>>> [1].
> > > >>>>>> At
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>> time
> > > >>>>>>>>>>>>>>>>>>> this was called "sequence view" but I do not like
> > > >> that
> > > >>>>> name.
> > > >>>>>>>>>>>>>>> However,
> > > >>>>>>>>>>>>>>>>>>> array-view array is a little confusing. Given this
> > > >> is
> > > >>>>>> similar
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>>>> list
> > > >>>>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>>>> we go with list-view array?
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to
> > > >> hear
> > > >>>>>> about
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors,
> > > >> and in
> > > >>>>> what
> > > >>>>>>>>>>>>>>>> situations
> > > >>>>>>>>>>>>>>>>>>> they
> > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > > >> current
> > > >>>>>>>>>> ListArray
> > > >>>>>>>>>>>>>>>>>>>> implementations.
> > > >>>>>>>>>>>>>>>>>>> I believe one significant benefit is that take (and
> > > >> by
> > > >>>>>> proxy,
> > > >>>>>>>>>>>>>>> filter)
> > > >>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>> sort are O(# of items) with the proposed format and
> > > >> O(#
> > > >>>> of
> > > >>>>>>>>>> bytes)
> > > >>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> current format. Jorge did some profiling to this
> > > >> effect
> > > >>>> in
> > > >>>>>>> [1].
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>
> > > >>>>>>
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<
> > > >>>>>>>>>>>>>
> > > >>>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> >
> > > >>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones <
> > > >>>>>>>>>>> will.jones...@gmail.com
> > > >>>>>>>>>>>>> <mailto:will.jones...@gmail.com>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>> Hi Felipe,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to
> > > >> hear
> > > >>>>>> about
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors,
> > > >> and in
> > > >>>>> what
> > > >>>>>>>>>>>>>>>> situations
> > > >>>>>>>>>>>>>>>>> they
> > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > > >> current
> > > >>>>>>>>>> ListArray
> > > >>>>>>>>>>>>>>>>>>>> implementations.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> IIUC it would be fairly cheap to transform a
> > > >> ListArray
> > > >>>> to
> > > >>>>>> an
> > > >>>>>>>>>>>>>>>>> ArrayView, but
> > > >>>>>>>>>>>>>>>>>>>> expensive to go the other way.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Will Jones
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira
> > > >>>> Carvalho
> > > >>>>> <
> > > >>>>>>>>>>>>>>>>>>>> felipe...@gmail.com<mailto:felipe...@gmail.com>>
> > > >>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Hi folks,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> I would like to start a public discussion on the
> > > >>>>> inclusion
> > > >>>>>>>>>> of a
> > > >>>>>>>>>>>>>>> new
> > > >>>>>>>>>>>>>>>>> array
> > > >>>>>>>>>>>>>>>>>>>>> format to Arrow — array-view array. The name is
> > > >> also
> > > >>>> up
> > > >>>>>> for
> > > >>>>>>>>>>>>>>> debate.
> > > >>>>>>>>>>>>>>>>>>>>> This format is inspired by Velox's ArrayVector
> > > >> format
> > > >>>>> [1].
> > > >>>>>>>>>>>>>>>> Logically,
> > > >>>>>>>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>>>>> array represents an array of arrays. Each element
> > > >> is
> > > >>>> an
> > > >>>>>>>>>>>>>>> array-view
> > > >>>>>>>>>>>>>>>>>>>> (offset
> > > >>>>>>>>>>>>>>>>>>>>> and size pair) that points to a range within a
> > > >> nested
> > > >>>>>>>>>> "values"
> > > >>>>>>>>>>>>>>>> array
> > > >>>>>>>>>>>>>>>>>>>>> (called "elements" in Velox docs). The nested
> > > >> array
> > > >>>> can
> > > >>>>> be
> > > >>>>>>> of
> > > >>>>>>>>>>> any
> > > >>>>>>>>>>>>>>>>> type,
> > > >>>>>>>>>>>>>>>>>>>>> which makes this format very flexible and
> > > >> powerful.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> [image: ../_images/array-vector.png]
> > > >>>>>>>>>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>
> https://facebookincubator.github.io/velox/_images/array-vector.png
> > > >>>>> <
> > > >>>>>>>>>>>>>
> > > >>>>>>
> > https://facebookincubator.github.io/velox/_images/array-vector.png
> > > >>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> I'm currently working on a C++ implementation and
> > > >> plan
> > > >>>>> to
> > > >>>>>>>>>> work
> > > >>>>>>>>>>>>>>> on a
> > > >>>>>>>>>>>>>>>>> Go
> > > >>>>>>>>>>>>>>>>>>>>> implementation to fulfill the two-implementations
> > > >>>>>>> requirement
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>> format
> > > >>>>>>>>>>>>>>>>>>>>> changes.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> The draft design:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets
> > > >> buffer,
> > > >>>>> int32
> > > >>>>>>>>>> sizes
> > > >>>>>>>>>>>>>>>>> buffer]
> > > >>>>>>>>>>>>>>>>>>>>> - 1 child array: "values" as an array of the type
> > > >>>>>> parameter
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> validity_bitmap is used to differentiate between
> > > >> empty
> > > >>>>>> array
> > > >>>>>>>>>>>>>>> views
> > > >>>>>>>>>>>>>>>>>>>>> (sizes[i] == 0) and NULL array views
> > > >>>> (validity_bitmap[i]
> > > >>>>>> ==
> > > >>>>>>>>>> 0).
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> When the validity_bitmap[i] is 0, both sizes and
> > > >>>> offsets
> > > >>>>>> are
> > > >>>>>>>>>>>>>>>>> undefined
> > > >>>>>>>>>>>>>>>>>>>> (as
> > > >>>>>>>>>>>>>>>>>>>>> usual), and when sizes[i] == 0, offsets[i] is
> > > >>>>> undefined. 0
> > > >>>>>>> is
> > > >>>>>>>>>>>>>>>>> recommended
> > > >>>>>>>>>>>>>>>>>>>>> if setting a value is not an issue to the system
> > > >>>>> producing
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> arrays.
> > > >>>>>>>>>>>>>>>>>>>>> offsets buffer is not required to be ordered and
> > > >> views
> > > >>>>>> don't
> > > >>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>>>>> disjoint.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > >>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>> Felipe O. Carvalho
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >
> > >
> > >
> >
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to