Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Andrew Lamb Thu, 15 Jun 2023 04:17:50 -0700

I want to be clear, insofar that ListView makes using the arrow libraries
more attractive to system developers, I am in favor of adding it.


Arrow the specification is focused on interoperability. Arrow the libraries
(specifically the compute kernels included in many implementations) also
offer fast computation

Even if ListView is rarely used for interoperability (if it never gains
wide adoption), some of the arrow implementations could use ListView to
offer faster computation kernels, which I think has real value

On Wed, Jun 14, 2023 at 8:32 PM Felipe Oliveira Carvalho <
[email protected]> wrote:

> On Wed, Jun 14, 2023 at 3:07 PM Andrew Lamb <[email protected]> wrote:
>
> > Arrow has at least 7 native "official" implementations (Java, Rust,
> Golang,
> > C#, Javascript, Julia and C++), 5 bindings on C++ (C, Ruby, Python, R,
> and
> > Matlab) and likely other implementations (like arrow2 in rust)
>
>
> Yes, the introduction of every new format has O(number_of_languages)
> complexity, but the complexity of the ListView format itself is low
> relative to other formats.
>
> We could also specify ListView to require a length-plus-1-long offsets
> buffer to make it just like a ListArray with an added sizes buffer. If
> offsets are sorted, a cast to ListArray would be trivial -- just share the
> offsets buffer. Explaining the format would be easier as well I think.
>
> I would love to discuss these details with the community.
>
>
> >
> I think it is worth remembering that depending on what level of support
> > ListView aspires to, such an addition could require non trivial changes
> to
> > many / all of those implementations (and the APIs they expose).
> >
> > Andrew
> >
> > On Wed, Jun 14, 2023 at 12:53 PM Felipe Oliveira Carvalho <
> > [email protected]> wrote:
> >
> > > General approach to alternative formats aside, in the specific case of
> > > ListView, I think the implementation complexity is being overestimated
> in
> > > these discussions.
> > >
> > > The C++ Arrow implementation shares a lot of code between List and
> > > LargeList. And with some tweaks, I'm able to share that common
> > > infrastructure for ListView as well. [1]
> > >
> > > ListView is similar to list: it doesn't require offsets to be sorted
> and
> > > adds an extra buffer containing sizes. For symmetry with the List and
> > > LargeList types (FixedSizeList not included), I'm going to propose we
> > add a
> > > LargeListView. That is not part of the draft implementation yet, but
> > seems
> > > like an obvious thing to have now that I implemented the `if_else`
> > > specialization. [2] David Li asked about this above and I can confirm
> now
> > > that 64-bit version of ListView (LargeListView) is in the plans.
> > >
> > > Trying to avoid re-implementing some kernels is not a good goal to
> chase,
> > > IMO, because kernels need tweaks to take advantage of the format.
> > >
> > > [1] https://github.com/apache/arrow/pull/35345
> > > [2] https://github.com/felipecrv/arrow/commits/list_view_backup
> > > --
> > > Felipe
> > >
> > > On Wed, Jun 14, 2023 at 12:08 PM Weston Pace <[email protected]>
> > > wrote:
> > >
> > > > > perhaps we could support this use-case as
> > > > > a canonical extension type over dictionary encoded, variable-sized
> > > > > arrays
> > > >
> > > > I believe this suggestion is valid and could be used to solve the
> > if-else
> > > > case.  The algorithm, if I understand it, would be roughly:
> > > >
> > > > ```
> > > > // Note: Simple pseudocode, vectorization left as exercise for the
> > reader
> > > > auto indices_builder = ...
> > > > auto list_builder = ...
> > > > indices_builder.resize(batch.length);
> > > > Array condition_mask = condition.EvaluateBatch(batch);
> > > > for row_index in selected_rows(condition_mask):
> > > >   indices_builder[row_index] = list_builder.CurrentLength();
> > > >   list_builder.Append(if_expr.EvaluateRow(batch, row_index))
> > > > for row_index in unselected_rows(condition_mask):
> > > >   indices_builder[row_index] = list_builder.CurrentLength();
> > > >   list_builder.Append(else_expr.EvaluateRow(batch, row_index))
> > > > return DictionaryArray(indices_builder.Finish(),
> list_builder.Finish())
> > > > ```
> > > >
> > > > I also agree this is a slightly awkward use of dictionaries (e.g. the
> > > > dictionary would have the same size as the # of indices) and perhaps
> > not
> > > > the most intuitive way to solve the problem.
> > > >
> > > > My gut reaction is simply "an improved if/else kernel is not, alone,
> > > enough
> > > > justification for a new layout" and yet...
> > > >
> > > > I think we are seeing the start (I hope) of a trend where Arrow is
> not
> > > just
> > > > used "between systems" (e.g. to shuttle data from one place to
> another,
> > > or
> > > > between a query engine and a visualization tool) but also "within
> > > systems"
> > > > (e.g. UDFs, bespoke file formats and temporary tables, between
> workers
> > > in a
> > > > distributed query engine).  When arrow is used "within systems" I
> think
> > > > both the number of bespoke formats and the significance of conversion
> > > cost
> > > > increases.  For example, it's easy to say that Velox should convert
> at
> > > the
> > > > boundary as data leaves Velox.  But what if Velox (or datafusion or
> > ...)
> > > > were to define an interface for UDFs.  Would we want to use Arrow
> there
> > > > (e.g. the C data interface is a good fit)?  If so, wouldn't the
> > > conversion
> > > > cost be more significant?
> > > >
> > > > > Also, I'm very lukewarm towards the concept of "alternative
> layouts"
> > > > > suggested somewhere else in this thread. It does not seem a good
> > choice
> > > > > to complexify the Arrow format that much.
> > > >
> > > > I think, in my opinion, this depends on how many of these alternative
> > > > layouts exist.  If there are just a few, then I agree, we should just
> > > adopt
> > > > them as formal first-class layouts.  If there are many, then I think
> it
> > > > will be too much complexity in Arrow to have all the different
> choices.
> > > > Or, we could say there are many, but the alternatives don't belong in
> > > Arrow
> > > > at all.  In that case I think it's the same question as the above
> > > > paragraph, "do we want Arrow to be used within systems?  Or just
> > between
> > > > systems?"
> > > >
> > > >
> > > > On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou <[email protected]>
> > > wrote:
> > > >
> > > > >
> > > > > I agree that ListView cannot be an extension type, given that it
> > > > > features a new layout, and therefore cannot reasonably be backed by
> > an
> > > > > existing storage type (AFAICT).
> > > > >
> > > > > Also, I'm very lukewarm towards the concept of "alternative
> layouts"
> > > > > suggested somewhere else in this thread. It does not seem a good
> > choice
> > > > > to complexify the Arrow format that much.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 07/06/2023 à 00:21, Felipe Oliveira Carvalho a écrit :
> > > > > > +1 on what Ian said.
> > > > > >
> > > > > > And as I write kernels for this new format, I’m learning that
> it’s
> > > > > possible
> > > > > > to re-use the common infrastructure used by List and LargeList to
> > > > > implement
> > > > > > the ListView related features with some adjustments.
> > > > > >
> > > > > > IMO having this format as a second-class citizen would more
> likely
> > > > > > complicate things because it would make this unification harder.
> > > > > >
> > > > > > —
> > > > > > Felipe
> > > > > >
> > > > > > On Tue, 6 Jun 2023 at 18:45 Ian Cook <[email protected]>
> wrote:
> > > > > >
> > > > > >> To clarify why we cannot simply propose adding ListView as a new
> > > > > >> “canonical extension type”: The extension type mechanism in
> Arrow
> > > > > >> depends on the underlying data being organized in an existing
> > Arrow
> > > > > >> layout—that way an implementation that does not support the
> > > extension
> > > > > >> type can still handle the underlying data. But ListView is a
> > wholly
> > > > > >> new layout.
> > > > > >>
> > > > > >> I strongly agree with Weston’s idea that it is a good time for
> > Arrow
> > > > > >> to introduce the notion of “canonical alternative layouts.”
> > > > > >>
> > > > > >> Taken together, I think that canonical extension types and
> > canonical
> > > > > >> alternative layouts could serve as an “incubator” for proposed
> new
> > > > > >> representations. For example, if a proposed canonical
> alternative
> > > > > >> layout ends up being broadly adopted, then that will serve as a
> > > signal
> > > > > >> that we should consider adding it as a primary layout in the
> core
> > > > > >> spec.
> > > > > >>
> > > > > >> It seems to me that most projects that are implementing Arrow
> > today
> > > > > >> are not aiming to provide complete coverage of Arrow; rather
> they
> > > are
> > > > > >> adopting Arrow because of its role as a standard and they are
> > > > > >> implementing only as much of the Arrow standard as they require
> to
> > > > > >> achieve some goal. I believe that such projects are important
> > Arrow
> > > > > >> stakeholders, and I believe that this proposed notion of
> canonical
> > > > > >> alternative layouts will serve them well and will create
> > > efficiencies
> > > > > >> by standardizing implementations around a shared set of
> > > alternatives.
> > > > > >>
> > > > > >> However I think that the documentation for canonical alternative
> > > > > >> layouts should strongly encourage implementers to default to
> using
> > > the
> > > > > >> primary layouts defined in the core spec and only use
> alternative
> > > > > >> layouts in cases where the primary layouts do not meet their
> > needs.
> > > > > >>
> > > > > >>
> > > > > >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield <
> > > > [email protected]>
> > > > > >> wrote:
> > > > > >>>
> > > > > >>> This sounds reasonable to me but my main concern is, I'm not
> sure
> > > > there
> > > > > >> is
> > > > > >>> a great mechanism to enforce canonical layouts don't somehow
> > become
> > > > > >> default
> > > > > >>> (or the only implementation).
> > > > > >>>
> > > > > >>> Even for these new layouts, I think it might be worth
> rethinking
> > > > > binding
> > > > > >> a
> > > > > >>> layout into the schema versus having a different concept of
> > > encoding
> > > > > (and
> > > > > >>> changing some of the corresponding data structures).
> > > > > >>>
> > > > > >>>
> > > > > >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace <
> > > [email protected]>
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Trying to settle on one option is a fruitless endeavor.  Each
> > type
> > > > has
> > > > > >> pros
> > > > > >>>> and cons.  I would also predict that the largest existing
> usage
> > of
> > > > > >> Arrow is
> > > > > >>>> shuttling data from one system to another.  The newly proposed
> > > > format
> > > > > >>>> doesn't appear to have any significant advantage for that use
> > case
> > > > (if
> > > > > >>>> anything, the existing format is arguably better as it is more
> > > > > >> compact).
> > > > > >>>>
> > > > > >>>> I am very biased towards historical precedent and avoiding
> > > breaking
> > > > > >>>> changes.
> > > > > >>>>
> > > > > >>>> We have "canonical extension types", perhaps it is time for
> > > > "canonical
> > > > > >>>> alternative layouts".  We could define it as such:
> > > > > >>>>
> > > > > >>>>   * There are one or more primary layouts
> > > > > >>>>     * Existing layouts are automatically considered primary
> > > layouts,
> > > > > >> even if
> > > > > >>>> they wouldn't
> > > > > >>>>       have been primary layouts initially (e.g. large list)
> > > > > >>>>   * A new layout, if it is semantically equivalent to another,
> > is
> > > > > >> considered
> > > > > >>>> an alternative layout
> > > > > >>>>   * An alternative layout still has the same requirements for
> > > > adoption
> > > > > >> (two
> > > > > >>>> implementations and a vote)
> > > > > >>>>     * An implementation should not feel pressured to rush and
> > > > > implement
> > > > > >> the
> > > > > >>>> new layout.
> > > > > >>>>       It would be good if they contribute in the discussion
> and
> > > > > >> consider the
> > > > > >>>> layout and vote
> > > > > >>>>       if they feel it would be an acceptable design.
> > > > > >>>>   * We can define and vote and approve as many canonical
> > > alternative
> > > > > >> layouts
> > > > > >>>> as we want:
> > > > > >>>>     * A canonical alternative layout should, at a minimum,
> have
> > > some
> > > > > >>>>       reasonable justification, such as improved performance
> for
> > > > > >> algorithm X
> > > > > >>>>   * Arrow implementations MUST support the primary layouts
> > > > > >>>>   * An Arrow implementation MAY support a canonical
> alternative,
> > > > > >> however:
> > > > > >>>>     * An Arrow implementation MUST first support the primary
> > > layout
> > > > > >>>>     * An Arrow implementation MUST support conversion to/from
> > the
> > > > > >> primary
> > > > > >>>> and canonical layout
> > > > > >>>>     * An Arrow implementation's APIs MUST only provide data in
> > the
> > > > > >>>> alternative
> > > > > >>>>       layout if it is explicitly asked for (e.g. schema
> > inference
> > > > > should
> > > > > >>>> prefer the primary layout).
> > > > > >>>>   * We can still vote for new primary layouts (e.g. promoting
> a
> > > > > >> canonical
> > > > > >>>> alternative) but, in these
> > > > > >>>>      votes we don't only consider the value (e.g. performance)
> > of
> > > > the
> > > > > >> layout
> > > > > >>>> but also the interoperability.
> > > > > >>>>      In other words, a layout can only become a primary layout
> > if
> > > > > there
> > > > > >> is
> > > > > >>>> significant evidence that most
> > > > > >>>>      implementations plan to adopt it.
> > > > > >>>>
> > > > > >>>> This lets us evolve support for new layouts more naturally.
> We
> > > can
> > > > > >>>> generally assume that users will not, initially, be aware of
> > these
> > > > > >>>> alternative layouts.  However, everything will just work.
> They
> > > may
> > > > > >> start
> > > > > >>>> to see a performance penalty stemming from a lack of support
> for
> > > > these
> > > > > >>>> layouts.  If this performance penalty becomes significant then
> > > they
> > > > > >> will
> > > > > >>>> discover it and become aware of the problem.  They can then
> ask
> > > > > >> whatever
> > > > > >>>> library they are using to add support for the alternative
> > layout.
> > > > As
> > > > > >>>> enough users find a need for it then libraries will add
> support.
> > > > > >>>> Eventually, enough libraries will support it that we can adopt
> > it
> > > > as a
> > > > > >>>> primary layout.
> > > > > >>>>
> > > > > >>>> Also, it allows libraries to adopt alternative layouts more
> > > > > >> aggressively if
> > > > > >>>> they would like while still hopefully ensuring that we
> > eventually
> > > > all
> > > > > >>>> converge on the same implementation of the alternative layout.
> > > > > >>>>
> > > > > >>>> On Mon, May 22, 2023 at 9:35 AM Will Jones <
> > > [email protected]
> > > > >
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Hello Arrow devs,
> > > > > >>>>>
> > > > > >>>>> I don't understand why we would start deprecating features in
> > the
> > > > > >> Arrow
> > > > > >>>>>> format. Even starting this talk might already be a bad idea
> > > > > >> PR-wise.
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> I agree we don't want to make breaking changes to the Arrow
> > > format.
> > > > > >> But
> > > > > >>>>> several maintainers have already stated they have no interest
> > in
> > > > > >>>>> maintaining both list types with full compute functionality
> > > [1][2],
> > > > > >> so I
> > > > > >>>>> think it's very likely one list type or the other will be
> > > > > >>>>> implicitly preferred in the ecosystem if this data type was
> > > added.
> > > > > >> If
> > > > > >>>>> that's the case, I'd prefer that we agreed as a community
> which
> > > one
> > > > > >>>> should
> > > > > >>>>> be preferred. Maybe that's not the best path; it's just one
> way
> > > for
> > > > > >> us to
> > > > > >>>>> balance stability, maintenance burden, and relevance.
> > > > > >>>>>
> > > > > >>>>> Can someone help distill down the primary rationale and
> usecase
> > > for
> > > > > >>>>>> adding ArrayView to the Arrow Spec?
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> Looking back at that old thread, I think one of the main
> > > > motivations
> > > > > >> is
> > > > > >>>> to
> > > > > >>>>> try to prevent query engine implementers from feeling there
> is
> > a
> > > > > >> tradeoff
> > > > > >>>>> between having state-of-the-art performance and being
> > > Arrow-native.
> > > > > >> For
> > > > > >>>>> some of the new array types, we had both Velox and DuckDB use
> > > them,
> > > > > >> so it
> > > > > >>>>> was reasonable to expect they were innovations that might
> > > > > >> proliferate.
> > > > > >>>> I'm
> > > > > >>>>> not sure if the ArrayView is part of that. From Wes earlier
> > [3]:
> > > > > >>>>>
> > > > > >>>>> The idea is that in a world of data and query federation (for
> > > > > >> example,
> > > > > >>>>>> consider [1] where Arrow is being used as a data federation
> > > layer
> > > > > >> with
> > > > > >>>>> many
> > > > > >>>>>> query engines), we want to increase the amount of data
> > in-flight
> > > > > >> and
> > > > > >>>>>> in-memory that is in Arrow format. So if query engines are
> > > having
> > > > > >> to
> > > > > >>>>> depart
> > > > > >>>>>> substantially from the Arrow format to get performance, then
> > > this
> > > > > >>>>> creates a
> > > > > >>>>>> potential lose-lose situation: * Depart from Arrow: get
> better
> > > > > >>>>> performance
> > > > > >>>>>> but pay serialization costs to read and write Arrow (the
> > > > > >> performance
> > > > > >>>> and
> > > > > >>>>>> resource utilization benefits outweigh the serialization
> > costs).
> > > > > >> This
> > > > > >>>>> puts
> > > > > >>>>>> additional pressure on query engines to build specialized
> > > > > >> components
> > > > > >>>> for
> > > > > >>>>>> solving problems rather than making use of off-the-shelf
> > > > components
> > > > > >>>> that
> > > > > >>>>>> use Arrow. This has knock-on effects on ecosystem
> > > fragmentation. *
> > > > > >> Or
> > > > > >>>> use
> > > > > >>>>>> Arrow, and accept suboptimal query processing performance
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> Will mentions one usecase is Velox consuming python UDF
> output,
> > > > which
> > > > > >>>> seems
> > > > > >>>>>> to be mostly about how fast Velox can consume this format,
> not
> > > how
> > > > > >> fast
> > > > > >>>>> it
> > > > > >>>>>> can be written. Are there other usecases?
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> To be clear, I don't know if that's the use case they want.
> > > That's
> > > > > >> just
> > > > > >>>> me
> > > > > >>>>> speculating.
> > > > > >>>>>
> > > > > >>>>> I still have some questions myself:
> > > > > >>>>>
> > > > > >>>>> 1. Is this array type currently only used in Velox? (not
> DuckDB
> > > > like
> > > > > >> some
> > > > > >>>>> of the other new types?) What evidence do we have that it
> will
> > > > become
> > > > > >>>> used
> > > > > >>>>> outside of Velox?
> > > > > >>>>> 2. We already have three list types: list, large list (64-bit
> > > > > >> offsets),
> > > > > >>>> and
> > > > > >>>>> fixed size list. Do we think we will only want a view version
> > of
> > > > the
> > > > > >>>> 32-bit
> > > > > >>>>> offset variable length list? Or are we potentially talking
> > about
> > > > view
> > > > > >>>>> variants for all three?
> > > > > >>>>>
> > > > > >>>>> Best,
> > > > > >>>>>
> > > > > >>>>> Will Jones
> > > > > >>>>>
> > > > > >>>>> [1]
> > > > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > > > > >>>>> [2]
> > > > https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3
> > > > > >>>>> [3]
> > > > https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t
> > > > > >>>>>
> > > > > >>>>> On Mon, May 22, 2023 at 3:48 AM Andrew Lamb <
> > > [email protected]>
> > > > > >>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Can someone help distill down the primary rationale and
> > usecase
> > > > for
> > > > > >>>>>> adding ArrayView to the Arrow Spec?
> > > > > >>>>>>
> > > > > >>>>>>  From the above discussions, the stated rationale seems to
> be
> > > fast
> > > > > >>>>>> (zero-copy) interchange with Velox.
> > > > > >>>>>>
> > > > > >>>>>> This thread has qualitatively enumerated the benefits of
> > > > > >> (offset+len)
> > > > > >>>>>> encoding over the existing Arrow ListArray (offets)
> approach,
> > > but
> > > > I
> > > > > >>>>> haven't
> > > > > >>>>>> seen any performance measurements that might help us to
> gauge
> > > the
> > > > > >>>>> tradeoff
> > > > > >>>>>> in additional complexity vs runtime overhead.
> > > > > >>>>>>
> > > > > >>>>>> Will mentions one usecase is Velox consuming python UDF
> > output,
> > > > > >> which
> > > > > >>>>> seems
> > > > > >>>>>> to be mostly about how fast Velox can consume this format,
> not
> > > how
> > > > > >> fast
> > > > > >>>>> it
> > > > > >>>>>> can be written. Are there other usecases?
> > > > > >>>>>>
> > > > > >>>>>> Do we have numbers showing how much overhead converting to
> > /from
> > > > > >>>> Velox's
> > > > > >>>>>> internal representation and the existing ListArray adds? Has
> > > > > >> anyone in
> > > > > >>>>>> Velox land considered adding faster support for Arrow style
> > > > > >> ListArray
> > > > > >>>>>> encoding?
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> Andrew
> > > > > >>>>>>
> > > > > >>>>>> On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou <
> > > > [email protected]
> > > > > >>>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Hi,
> > > > > >>>>>>>
> > > > > >>>>>>> I don't understand why we would start deprecating features
> in
> > > the
> > > > > >>>> Arrow
> > > > > >>>>>>> format. Even starting this talk might already be a bad idea
> > > > > >> PR-wise.
> > > > > >>>>>>>
> > > > > >>>>>>> As for implementing conversions at the I/O boundary, it's a
> > > > > >>>> reasonably
> > > > > >>>>>>> policy, but it still requires work by implementors and it's
> > not
> > > > > >>>> granted
> > > > > >>>>>>> that all consumers of the Arrow format will grow such
> > > conversions
> > > > > >>>>>>> if/when we add non-trivial types such as ListView or
> > > StringView.
> > > > > >>>>>>>
> > > > > >>>>>>> Regards
> > > > > >>>>>>>
> > > > > >>>>>>> Antoine.
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Le 22/05/2023 à 00:39, Will Jones a écrit :
> > > > > >>>>>>>> One more thing: Looking back on the previous discussion[1]
> > > > > >> (which
> > > > > >>>>>> Weston
> > > > > >>>>>>>> pointed out in his earlier message), Jorge suggested that
> > the
> > > > > >> old
> > > > > >>>>> list
> > > > > >>>>>>>> types might be deprecated in favor of view variants [2].
> > > Others
> > > > > >>>> were
> > > > > >>>>>>>> worried that it might undermine the perception that the
> > Arrow
> > > > > >>>> format
> > > > > >>>>> is
> > > > > >>>>>>>> stable. I think it might be worth thinking about "soft
> > > > > >> deprecating"
> > > > > >>>>> the
> > > > > >>>>>>> old
> > > > > >>>>>>>> list type: suggesting new implementations prefer the list
> > > > > >> view, but
> > > > > >>>>>>>> reassuring that implementations should support the old
> > format,
> > > > > >> even
> > > > > >>>>> if
> > > > > >>>>>>> they
> > > > > >>>>>>>> just convert to the new format. To be clear, this wouldn't
> > > > > >> mean we
> > > > > >>>>> plan
> > > > > >>>>>>> to
> > > > > >>>>>>>> create breaking changes in the format; but if we ever did
> > for
> > > > > >> other
> > > > > >>>>>>>> reasons, the old list type might go.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Arrow compute libraries could choose either format for
> > compute
> > > > > >>>>> support,
> > > > > >>>>>>> and
> > > > > >>>>>>>> plan to do conversion at the boundaries. Libraries that
> use
> > > > > >> the new
> > > > > >>>>>> type
> > > > > >>>>>>>> will have cheap conversion when reading the old type.
> > > Meanwhile
> > > > > >>>> those
> > > > > >>>>>>> that
> > > > > >>>>>>>> are still on the old type will have some incentive to move
> > > > > >> towards
> > > > > >>>>> the
> > > > > >>>>>>> new
> > > > > >>>>>>>> one, since that conversion will not be as efficient.
> > > > > >>>>>>>>
> > > > > >>>>>>>> [1]
> > > > > >>>>
> > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > > > >>>>>>>> [2]
> > > > > >>>>
> > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones <
> > > > > >>>> [email protected]>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Hello,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I think Sasha brings up a good point, that the advantages
> > of
> > > > > >> this
> > > > > >>>>>> format
> > > > > >>>>>>>>> seem to be primarily about query processing. Other
> > encodings
> > > > > >> like
> > > > > >>>>> REE
> > > > > >>>>>>> and
> > > > > >>>>>>>>> dictionary have space-saving advantages that justify them
> > > > > >> simply
> > > > > >>>> in
> > > > > >>>>>>> terms
> > > > > >>>>>>>>> of space efficiency (although they have query processing
> > > > > >>>> advantages
> > > > > >>>>> as
> > > > > >>>>>>>>> well). As discussed, most use cases are already well
> served
> > > by
> > > > > >>>>>> existing
> > > > > >>>>>>>>> list types and dictionary encoding.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I agree that there are cases where transferring this type
> > > > > >> without
> > > > > >>>>>>>>> conversion would be ideal. One use case I can think of is
> > if
> > > > > >> Velox
> > > > > >>>>>>> wants to
> > > > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with
> > > > > >> PyArrow,
> > > > > >>>> for
> > > > > >>>>>>>>> example) that operate on this column type and therefore
> > wants
> > > > > >>>>>> zero-copy
> > > > > >>>>>>>>> exchange over the C Data Interface.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> One big question I have: we already have three list
> types:
> > > > > >> list,
> > > > > >>>>> large
> > > > > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think
> we
> > > > > >> will
> > > > > >>>> only
> > > > > >>>>>>> want a
> > > > > >>>>>>>>> view version of the 32-bit offset variable length list?
> Or
> > > > > >> are we
> > > > > >>>>>>>>> potentially talking about view variants for all three?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Best,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Will Jones
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho
> <
> > > > > >>>>>>>>> [email protected]> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> The benefit of having a memory format that’s friendly to
> > > > > >>>>>>> non-deterministic
> > > > > >>>>>>>>>> order writes is unlocked by the transport and processing
> > of
> > > > > >> the
> > > > > >>>>> data
> > > > > >>>>>>> being
> > > > > >>>>>>>>>> agnostic to the physical order as much as possible.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Requiring a conversion could cancel out that benefit.
> But
> > it
> > > > > >> can
> > > > > >>>>> be a
> > > > > >>>>>>>>>> provisory step for compatibility between systems that
> > don’t
> > > > > >>>>>> understand
> > > > > >>>>>>> the
> > > > > >>>>>>>>>> format yet. This is similar to the situation with
> > > compression
> > > > > >>>>> schemes
> > > > > >>>>>>> like
> > > > > >>>>>>>>>> run-end encoding — the goal is processing the compressed
> > > data
> > > > > >>>>>> directly
> > > > > >>>>>>>>>> without an expansion step whenever possible.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> This is why having it as part of the open Arrow format
> is
> > so
> > > > > >>>>>> important:
> > > > > >>>>>>>>>> everyone can agree on a format that’s friendly to
> parallel
> > > > > >> and/or
> > > > > >>>>>>>>>> vectorized compute kernels without introducing multiple
> > > > > >>>>> incompatible
> > > > > >>>>>>>>>> formats to the ecosystem and without imposing a
> conversion
> > > > > >> step
> > > > > >>>>>> between
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>> different systems.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> —
> > > > > >>>>>>>>>> Felipe
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin
> > > > > >> <[email protected]>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> I don't feel like this representation is necessarily a
> > > > > >> detail of
> > > > > >>>>> the
> > > > > >>>>>>>>>> query
> > > > > >>>>>>>>>>> engine, but I am also not sure why this representation
> > > would
> > > > > >>>> have
> > > > > >>>>> to
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>> converted to a non-view format when serializing. Could
> > you
> > > > > >>>> clarify
> > > > > >>>>>>>>>> that? My
> > > > > >>>>>>>>>>> impression is that this representation could be used
> for
> > > > > >>>>> persistence
> > > > > >>>>>>> or
> > > > > >>>>>>>>>>> data transfer, though it can be more complex to
> guarantee
> > > > > >> the
> > > > > >>>>>> portion
> > > > > >>>>>>> of
> > > > > >>>>>>>>>>> the buffer that an index points to is also present in
> > > > > >> memory.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Sent from Proton Mail for iOS
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > >>>>>>>>>> [email protected]
> > > > > >>>>>>>>>>>
> > > > > >> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Hi everyone,
> > > > > >>>>>>>>>>> I understand that there are numerous benefits to this
> > > > > >>>>> representation
> > > > > >>>>>>>>>>> during query processing, but would it be fair to say
> that
> > > > > >> this
> > > > > >>>> is
> > > > > >>>>> an
> > > > > >>>>>>>>>>> implementation detail of the query engine? Query
> engines
> > > > > >> don’t
> > > > > >>>>>>>>>> necessarily
> > > > > >>>>>>>>>>> need to conform to the Arrow format internally, only at
> > > > > >>>>>> ingest/egress
> > > > > >>>>>>>>>>> points, and performing a conversion from the non-view
> to
> > > > > >> view
> > > > > >>>>> format
> > > > > >>>>>>>>>> seems
> > > > > >>>>>>>>>>> like it would be very cheap (though I understand not
> > > > > >> necessarily
> > > > > >>>>> the
> > > > > >>>>>>>>>> other
> > > > > >>>>>>>>>>> way around, but you’d need to do that anyway if you’re
> > > > > >>>>> serializing).
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Sasha Krassovsky
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> 20 мая 2023 г., в 13:00, Will Jones <
> > > > > >> [email protected]>
> > > > > >>>>>>>>>>> написал(а):
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thanks for sharing these details, Pedro. The
> > conditional
> > > > > >>>>> branches
> > > > > >>>>>>>>>>> argument
> > > > > >>>>>>>>>>>> makes a lot of sense to me.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> The tensors point brings up some interesting issues.
> For
> > > > > >> now,
> > > > > >>>>> we've
> > > > > >>>>>>>>>>> defined
> > > > > >>>>>>>>>>>> our only tensor extension type to be built on a fixed
> > size
> > > > > >>>> list.
> > > > > >>>>>> If a
> > > > > >>>>>>>>>> use
> > > > > >>>>>>>>>>>> case of this might be manipulating tensors with zero
> > copy,
> > > > > >>>>> perhaps
> > > > > >>>>>>>>>> that
> > > > > >>>>>>>>>>>> suggests that we want a fixed size list variant? In
> > > > > >> addition,
> > > > > >>>>> would
> > > > > >>>>>>> we
> > > > > >>>>>>>>>>> have
> > > > > >>>>>>>>>>>> to define another extension type to be a ListView
> > variant?
> > > > > >> Or
> > > > > >>>>> would
> > > > > >>>>>>> we
> > > > > >>>>>>>>>>> want
> > > > > >>>>>>>>>>>> to think about making extension types somehow valid
> > across
> > > > > >>>>> various
> > > > > >>>>>>>>>>>> encodings of the same "logical type"?
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha
> > > > > >> Pedreira
> > > > > >>>>>>>>>>>>> <[email protected]> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> This is Pedro from the Velox team at Meta. This is my
> > > > > >> first
> > > > > >>>> time
> > > > > >>>>>>>>>> here,
> > > > > >>>>>>>>>>> so
> > > > > >>>>>>>>>>>>> nice to e-meet you!
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Adding to what Felipe said, the main reason we
> created
> > > > > >>>>> “ListView”
> > > > > >>>>>>>>>>> (though
> > > > > >>>>>>>>>>>>> we just call them ArrayVector/MapVector in Velox) is
> > > that,
> > > > > >>>> along
> > > > > >>>>>>> with
> > > > > >>>>>>>>>>>>> StringViews for strings, they allow us to write any
> > > > > >> columnar
> > > > > >>>>>> buffer
> > > > > >>>>>>>>>>>>> out-or-order, regardless of their types or encodings.
> > > > > >> This is
> > > > > >>>>>>>>>> naturally
> > > > > >>>>>>>>>>>>> doable for all primitive types (fixed-size), but not
> > for
> > > > > >> types
> > > > > >>>>>> that
> > > > > >>>>>>>>>>> don’t
> > > > > >>>>>>>>>>>>> have fixed size and are required to be contiguous.
> The
> > > > > >>>>> StringView
> > > > > >>>>>>> and
> > > > > >>>>>>>>>>>>> ListView formats allow us to keep this invariant in
> > > Velox.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Being able to write vectors out-of-order is useful
> when
> > > > > >>>>> executing
> > > > > >>>>>>>>>>>>> conditionals like IF/SWITCH statements, which are
> > > > > >> pervasive
> > > > > >>>>> among
> > > > > >>>>>>> our
> > > > > >>>>>>>>>>>>> workloads. To fully vectorize it, one first evaluates
> > the
> > > > > >>>>>>> expression,
> > > > > >>>>>>>>>>> then
> > > > > >>>>>>>>>>>>> generate a bitmap containing which rows take the THEN
> > and
> > > > > >>>> which
> > > > > >>>>>> take
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>> ELSE branch. Then you populate all rows that match
> the
> > > > > >> first
> > > > > >>>>>> branch
> > > > > >>>>>>>>>> by
> > > > > >>>>>>>>>>>>> evaluating the THEN expression in a vectorized
> > > > > >> (branch-less
> > > > > >>>> and
> > > > > >>>>>>> cache
> > > > > >>>>>>>>>>>>> friendly) way, and subsequently the ELSE branch. If
> you
> > > > > >> can’t
> > > > > >>>>>> write
> > > > > >>>>>>>>>> them
> > > > > >>>>>>>>>>>>> out-of-order, you would either have a big branch per
> > row
> > > > > >>>>>> dispatching
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>> right expression (slow), or populate two distinct
> > vectors
> > > > > >> then
> > > > > >>>>>>>>>> merging
> > > > > >>>>>>>>>>> them
> > > > > >>>>>>>>>>>>> at the end (potentially even slower). How much faster
> > our
> > > > > >>>>> approach
> > > > > >>>>>>> is
> > > > > >>>>>>>>>>>>> highly depends on the buffer sizes and expressions,
> but
> > > we
> > > > > >>>> found
> > > > > >>>>>> it
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>> be
> > > > > >>>>>>>>>>>>> faster enough on average to justify us extending the
> > > > > >>>> underlying
> > > > > >>>>>>>>>> layout.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> With that said, this is all within a single thread of
> > > > > >>>> execution.
> > > > > >>>>>>>>>>>>> Parallelization is done by feeding each thread its
> own
> > > > > >>>>>> vector/data.
> > > > > >>>>>>>>>> As
> > > > > >>>>>>>>>>>>> pointed out in a previous message, this also gives
> you
> > > the
> > > > > >>>>>>>>>> flexibility
> > > > > >>>>>>>>>>> to
> > > > > >>>>>>>>>>>>> implement cardinality increasing/reducing operations,
> > but
> > > > > >> we
> > > > > >>>>> don’t
> > > > > >>>>>>>>>> use
> > > > > >>>>>>>>>>> it
> > > > > >>>>>>>>>>>>> for that purpose. Operations like filtering, joining,
> > > > > >>>> unnesting
> > > > > >>>>>> and
> > > > > >>>>>>>>>>> similar
> > > > > >>>>>>>>>>>>> are done by wrapping the internal vector in a
> > dictionary,
> > > > > >> as
> > > > > >>>>> these
> > > > > >>>>>>>>>> need
> > > > > >>>>>>>>>>> to
> > > > > >>>>>>>>>>>>> work not only on “ListViews” but on any data types
> with
> > > > > >> any
> > > > > >>>>>>> encoding.
> > > > > >>>>>>>>>>> There
> > > > > >>>>>>>>>>>>> are more details on Section 4.2.1 in [1]
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Beyond this, it also gives function/kernel developers
> > > more
> > > > > >>>>>>>>>> flexibility
> > > > > >>>>>>>>>>> to
> > > > > >>>>>>>>>>>>> implement operations that manipulate Arrays/Maps. For
> > > > > >> example,
> > > > > >>>>>>>>>>> operations
> > > > > >>>>>>>>>>>>> that slice these containers can be implemented in a
> > > > > >> zero-copy
> > > > > >>>>>> manner
> > > > > >>>>>>>>>> by
> > > > > >>>>>>>>>>>>> just rearranging the lengths/offsets indices, without
> > > ever
> > > > > >>>>>> touching
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>> larger internal buffers. This is a similar motivation
> > as
> > > > > >> for
> > > > > >>>>>>>>>> StringView
> > > > > >>>>>>>>>>>>> (think of substr(), trim(), and similar). One nice
> last
> > > > > >>>> property
> > > > > >>>>>> is
> > > > > >>>>>>>>>> that
> > > > > >>>>>>>>>>>>> this layout allows for overlapping ranges. This is
> > > > > >> something
> > > > > >>>>>>>>>> discussed
> > > > > >>>>>>>>>>> with
> > > > > >>>>>>>>>>>>> our ML people to allow deduping feature values in a
> > > tensor
> > > > > >>>>> (which
> > > > > >>>>>> is
> > > > > >>>>>>>>>>> fairly
> > > > > >>>>>>>>>>>>> common), but not something we have leveraged just
> yet.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> [1] -
> https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Best,
> > > > > >>>>>>>>>>>>> --
> > > > > >>>>>>>>>>>>> Pedro Pedreira
> > > > > >>>>>>>>>>>>> ________________________________
> > > > > >>>>>>>>>>>>> From: Felipe Oliveira Carvalho <[email protected]>
> > > > > >>>>>>>>>>>>> Sent: Friday, May 19, 2023 10:01 AM
> > > > > >>>>>>>>>>>>> To: [email protected] <[email protected]>
> > > > > >>>>>>>>>>>>> Cc: Pedro Eugenio Rocha Pedreira <[email protected]>
> > > > > >>>>>>>>>>>>> Subject: Re: [DISCUSS][Format] Starting the draft
> > > > > >>>> implementation
> > > > > >>>>>> of
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>> ArrayView array format
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael
> > > > > >> Taylor-Davies
> > > > > >>>>> <r.
> > > > > >>>>>>>>>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi
> > All, >
> > > > > >> if
> > > > > >>>> we
> > > > > >>>>>>> added
> > > > > >>>>>>>>>>>>> this, do we think many Arrow and query > engine
> > > > > >>>> implementations
> > > > > >>>>>> (for
> > > > > >>>>>>>>>>>>> example, DataFusion) will be
> > > > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerStart
> > > > > >>>>>>>>>>>>> This Message Is From an External Sender
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerEnd
> > > > > >>>>>>>>>>>>> +pedroerp
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
> > > > > >>>>>>>>>>>>> <[email protected]> wrote:
> > > > > >>>>>>>>>>>>> Hi All,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> if we added this, do we think many Arrow and query
> > > > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion)
> will
> > be
> > > > > >>>> eager
> > > > > >>>>> to
> > > > > >>>>>>>>>> add
> > > > > >>>>>>>>>>>>> full
> > > > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or
> > are
> > > > > >> they
> > > > > >>>>>> likely
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>>>> just
> > > > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > > > > >>>>>>>>>>>>> I can't speak for query engines in general, but at
> > least
> > > > > >> for
> > > > > >>>>>>> arrow-rs
> > > > > >>>>>>>>>>>>> and by extension DataFusion, and based on my current
> > > > > >>>>> understanding
> > > > > >>>>>>> of
> > > > > >>>>>>>>>>>>> the use-cases I would be rather hesitant to add
> support
> > > > > >> to the
> > > > > >>>>>>>>>> kernels
> > > > > >>>>>>>>>>>>> for this array type, definitely instead favouring
> > > > > >> conversion
> > > > > >>>> at
> > > > > >>>>>> the
> > > > > >>>>>>>>>>>>> edges. We already have issues with the amount of code
> > > > > >>>> generation
> > > > > >>>>>>>>>>>>> resulting in binary bloat and long compile times,
> and I
> > > > > >> worry
> > > > > >>>>> this
> > > > > >>>>>>>>>> would
> > > > > >>>>>>>>>>>>> worsen this situation whilst not really providing
> > > > > >> compelling
> > > > > >>>>>>>>>> advantages
> > > > > >>>>>>>>>>>>> for the vast majority of workloads that don't
> interact
> > > > > >> with
> > > > > >>>>> Velox.
> > > > > >>>>>>>>>>>>> Whilst I can definitely see that the ListView
> > > > > >> representation
> > > > > >>>> is
> > > > > >>>>>>>>>> probably
> > > > > >>>>>>>>>>>>> a better way to represent variable length lists than
> > what
> > > > > >>>> arrow
> > > > > >>>>>>>>>> settled
> > > > > >>>>>>>>>>>>> upon, I'm not yet convinced it is sufficiently better
> > to
> > > > > >>>>>> incentivise
> > > > > >>>>>>>>>>>>> broad ecosystem adoption.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Kind Regards,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Raphael Taylor-Davies
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On 11/05/2023 21:20, Will Jones wrote:
> > > > > >>>>>>>>>>>>>> Hi Felipe,
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Thanks for the additional details.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append
> data
> > to
> > > > > >> the
> > > > > >>>>>> array
> > > > > >>>>>>>>>> from
> > > > > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > > > > >> Only the
> > > > > >>>>>>>>>> offsets
> > > > > >>>>>>>>>>>>> array
> > > > > >>>>>>>>>>>>>>> has to be written according to logical order but
> that
> > > is
> > > > > >>>>>>>>>> potentially a
> > > > > >>>>>>>>>>>>> much
> > > > > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> It still seems to me like applications are still
> > pretty
> > > > > >>>> niche,
> > > > > >>>>>> as I
> > > > > >>>>>>>>>>>>> suspect
> > > > > >>>>>>>>>>>>>> in most cases the benefits are outweighed by the
> > costs.
> > > > > >> The
> > > > > >>>>>> benefit
> > > > > >>>>>>>>>>> here
> > > > > >>>>>>>>>>>>>> seems pretty limited: if you are trying to split
> work
> > > > > >> between
> > > > > >>>>>>>>>> threads,
> > > > > >>>>>>>>>>>>>> usually you will have other levels such as array
> > chunks
> > > > > >> to
> > > > > >>>>>>>>>> parallelize.
> > > > > >>>>>>>>>>>>> And
> > > > > >>>>>>>>>>>>>> if you have an incoming stream of row data, you'll
> > want
> > > > > >> to
> > > > > >>>>> append
> > > > > >>>>>>> in
> > > > > >>>>>>>>>>>>>> predictable order to match the order of the other
> > > > > >> arrays. Am
> > > > > >>>> I
> > > > > >>>>>>>>>> missing
> > > > > >>>>>>>>>>>>>> something?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> And, IIUC, the cost of using ListView with
> > out-of-order
> > > > > >>>> values
> > > > > >>>>>> over
> > > > > >>>>>>>>>>>>>> ListArray is you lose memory locality; the values of
> > > > > >> element
> > > > > >>>> 2
> > > > > >>>>>> are
> > > > > >>>>>>>>>> no
> > > > > >>>>>>>>>>>>>> longer adjacent to the values of element 1. What do
> > you
> > > > > >> think
> > > > > >>>>>> about
> > > > > >>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>> tradeoff?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> I don't mean to be difficult about this. I'm excited
> > for
> > > > > >> both
> > > > > >>>>> the
> > > > > >>>>>>>>>> REE
> > > > > >>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>> StringView arrays, but this one I'm not so sure
> about
> > > > > >> yet. I
> > > > > >>>>>>> suppose
> > > > > >>>>>>>>>>>>> what I
> > > > > >>>>>>>>>>>>>> am trying to ask is, if we added this, do we think
> > many
> > > > > >> Arrow
> > > > > >>>>> and
> > > > > >>>>>>>>>> query
> > > > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion)
> will
> > be
> > > > > >>>> eager
> > > > > >>>>> to
> > > > > >>>>>>>>>> add
> > > > > >>>>>>>>>>>>> full
> > > > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or
> > are
> > > > > >> they
> > > > > >>>>>> likely
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>>>> just
> > > > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Because if it turns out to be the latter, then we
> > might
> > > > > >> as
> > > > > >>>> well
> > > > > >>>>>> ask
> > > > > >>>>>>>>>>> Velox
> > > > > >>>>>>>>>>>>>> to export this type as ListArray and save the rest
> of
> > > the
> > > > > >>>>>> ecosystem
> > > > > >>>>>>>>>>> some
> > > > > >>>>>>>>>>>>>> work.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Best,
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Will Jones
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira
> > > > > >> Carvalho <
> > > > > >>>>>>>>>>>>>> [email protected]<mailto:[email protected]>>
> > wrote:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Initial reason for ListView arrays in Arrow is
> > > zero-copy
> > > > > >>>>>>>>>> compatibility
> > > > > >>>>>>>>>>>>> with
> > > > > >>>>>>>>>>>>>>> Velox which uses this format.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append
> data
> > to
> > > > > >> the
> > > > > >>>>>> array
> > > > > >>>>>>>>>> from
> > > > > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > > > > >> Only the
> > > > > >>>>>>>>>> offsets
> > > > > >>>>>>>>>>>>> array
> > > > > >>>>>>>>>>>>>>> has to be written according to logical order but
> that
> > > is
> > > > > >>>>>>>>>> potentially a
> > > > > >>>>>>>>>>>>> much
> > > > > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Acero kernels could take advantage of that in the
> > > > > >> future.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> In implementing ListViewArray/Type I was able to
> > reuse
> > > > > >> some
> > > > > >>>>> C++
> > > > > >>>>>>>>>>>>> templates
> > > > > >>>>>>>>>>>>>>> used for ListArray which can reduce some of the
> > burden
> > > > > >> on
> > > > > >>>>> kernel
> > > > > >>>>>>>>>>>>>>> implementations that aim to work with all the
> types.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> I’m can fix Acero kernels for working with
> ListView.
> > > > > >> This is
> > > > > >>>>>>>>>> similar
> > > > > >>>>>>>>>>> to
> > > > > >>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> work I’ve doing in kernels dealing with run-end
> > encoded
> > > > > >>>>> arrays.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> —
> > > > > >>>>>>>>>>>>>>> Felipe
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Wed, 26 Apr 2023 at 01:03 Will Jones <
> > > > > >>>>>> [email protected]
> > > > > >>>>>>>>>>>>> <mailto:[email protected]>> wrote:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I suppose one common use case is materializing
> list
> > > > > >> columns
> > > > > >>>>>> after
> > > > > >>>>>>>>>>> some
> > > > > >>>>>>>>>>>>>>>> expanding operation like a join or unnest. That's
> a
> > > > > >> case
> > > > > >>>>> where
> > > > > >>>>>> I
> > > > > >>>>>>>>>>> could
> > > > > >>>>>>>>>>>>>>>> imagine a lot of repetition of values. Haven't yet
> > > > > >> thought
> > > > > >>>> of
> > > > > >>>>>>>>>> common
> > > > > >>>>>>>>>>>>>>> cases
> > > > > >>>>>>>>>>>>>>>> where there is overlap but not full duplication,
> but
> > > am
> > > > > >>>> eager
> > > > > >>>>>> to
> > > > > >>>>>>>>>> hear
> > > > > >>>>>>>>>>>>>>> any.
> > > > > >>>>>>>>>>>>>>>> The dictionary encoding point Raphael makes is
> > > > > >> interesting,
> > > > > >>>>>>>>>>> especially
> > > > > >>>>>>>>>>>>>>>> given the existence of LargeList and
> FixedSizeList.
> > > For
> > > > > >>>> many
> > > > > >>>>>>>>>>>>> operations,
> > > > > >>>>>>>>>>>>>>> it
> > > > > >>>>>>>>>>>>>>>> might make more sense to just compose those
> existing
> > > > > >> types.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> IIUC the operations that would be unique to the
> > > > > >> ArrayView
> > > > > >>>> are
> > > > > >>>>>>> ones
> > > > > >>>>>>>>>>>>>>> altering
> > > > > >>>>>>>>>>>>>>>> the shape. One could truncate each array to a
> > certain
> > > > > >>>> length
> > > > > >>>>>>>>>> cheaply
> > > > > >>>>>>>>>>>>>>> simply
> > > > > >>>>>>>>>>>>>>>> by replacing the sizes buffer. Or perhaps there
> are
> > > > > >>>>> interesting
> > > > > >>>>>>>>>>>>>>> operations
> > > > > >>>>>>>>>>>>>>>> on tensors that would benefit.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 7:47 PM Raphael
> > Taylor-Davies
> > > > > >>>>>>>>>>>>>>>> <[email protected]> wrote:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Unless I am missing something, I think the
> > selection
> > > > > >>>>> use-case
> > > > > >>>>>>>>>> could
> > > > > >>>>>>>>>>> be
> > > > > >>>>>>>>>>>>>>>>> equally well served by a dictionary-encoded
> > > > > >>>>>>> BinarArray/ListArray,
> > > > > >>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>> would
> > > > > >>>>>>>>>>>>>>>>> have the benefit of not requiring any
> modifications
> > > > > >> to the
> > > > > >>>>>>>>>> existing
> > > > > >>>>>>>>>>>>>>>> format
> > > > > >>>>>>>>>>>>>>>>> or kernels.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> The major additional flexibility of the proposed
> > > > > >> encoding
> > > > > >>>>>> would
> > > > > >>>>>>>>>> be
> > > > > >>>>>>>>>>>>>>>>> permitting disjoint or overlapping ranges, are
> > these
> > > > > >>>> common
> > > > > >>>>>>>>>> enough
> > > > > >>>>>>>>>>> in
> > > > > >>>>>>>>>>>>>>>>> practice to represent a meaningful bottleneck?
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On 26 April 2023 01:40:14 BST, David Li <
> > > > > >>>>> [email protected]
> > > > > >>>>>>>>>>> <mailto:
> > > > > >>>>>>>>>>>>> [email protected]>> wrote:
> > > > > >>>>>>>>>>>>>>>>>> Is there a need for a 64-bit offsets version the
> > > > > >> same way
> > > > > >>>>> we
> > > > > >>>>>>>>>> have
> > > > > >>>>>>>>>>>>> List
> > > > > >>>>>>>>>>>>>>>>> and LargeList?
> > > > > >>>>>>>>>>>>>>>>>> And just to be clear, the difference with List
> is
> > > > > >> that
> > > > > >>>> the
> > > > > >>>>>>> lists
> > > > > >>>>>>>>>>>>> don't
> > > > > >>>>>>>>>>>>>>>>> have to be stored in their logical order (or in
> > other
> > > > > >>>> words,
> > > > > >>>>>>>>>> offsets
> > > > > >>>>>>>>>>>>> do
> > > > > >>>>>>>>>>>>>>>> not
> > > > > >>>>>>>>>>>>>>>>> have to be nondecreasing and so we also need
> > sizes)?
> > > > > >>>>>>>>>>>>>>>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace
> wrote:
> > > > > >>>>>>>>>>>>>>>>>>> For context, there was some discussion on this
> > back
> > > > > >> in
> > > > > >>>>> [1].
> > > > > >>>>>> At
> > > > > >>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>>>> time
> > > > > >>>>>>>>>>>>>>>>>>> this was called "sequence view" but I do not
> like
> > > > > >> that
> > > > > >>>>> name.
> > > > > >>>>>>>>>>>>>>> However,
> > > > > >>>>>>>>>>>>>>>>>>> array-view array is a little confusing. Given
> > this
> > > > > >> is
> > > > > >>>>>> similar
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>> list
> > > > > >>>>>>>>>>>>>>>>> can
> > > > > >>>>>>>>>>>>>>>>>>> we go with list-view array?
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested
> > to
> > > > > >> hear
> > > > > >>>>>> about
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these
> vectors,
> > > > > >> and in
> > > > > >>>>> what
> > > > > >>>>>>>>>>>>>>>> situations
> > > > > >>>>>>>>>>>>>>>>>>> they
> > > > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > > > > >> current
> > > > > >>>>>>>>>> ListArray
> > > > > >>>>>>>>>>>>>>>>>>>> implementations.
> > > > > >>>>>>>>>>>>>>>>>>> I believe one significant benefit is that take
> > (and
> > > > > >> by
> > > > > >>>>>> proxy,
> > > > > >>>>>>>>>>>>>>> filter)
> > > > > >>>>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>>>> sort are O(# of items) with the proposed format
> > and
> > > > > >> O(#
> > > > > >>>> of
> > > > > >>>>>>>>>> bytes)
> > > > > >>>>>>>>>>>>>>> with
> > > > > >>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>> current format. Jorge did some profiling to
> this
> > > > > >> effect
> > > > > >>>> in
> > > > > >>>>>>> [1].
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> [1]
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>
> > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>
> > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > >
> > > > > >>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones <
> > > > > >>>>>>>>>>> [email protected]
> > > > > >>>>>>>>>>>>> <mailto:[email protected]>
> > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>> Hi Felipe,
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested
> > to
> > > > > >> hear
> > > > > >>>>>> about
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these
> vectors,
> > > > > >> and in
> > > > > >>>>> what
> > > > > >>>>>>>>>>>>>>>> situations
> > > > > >>>>>>>>>>>>>>>>> they
> > > > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > > > > >> current
> > > > > >>>>>>>>>> ListArray
> > > > > >>>>>>>>>>>>>>>>>>>> implementations.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> IIUC it would be fairly cheap to transform a
> > > > > >> ListArray
> > > > > >>>> to
> > > > > >>>>>> an
> > > > > >>>>>>>>>>>>>>>>> ArrayView, but
> > > > > >>>>>>>>>>>>>>>>>>>> expensive to go the other way.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Best,
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Will Jones
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe
> Oliveira
> > > > > >>>> Carvalho
> > > > > >>>>> <
> > > > > >>>>>>>>>>>>>>>>>>>> [email protected]<mailto:
> [email protected]
> > >>
> > > > > >>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> Hi folks,
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> I would like to start a public discussion on
> > the
> > > > > >>>>> inclusion
> > > > > >>>>>>>>>> of a
> > > > > >>>>>>>>>>>>>>> new
> > > > > >>>>>>>>>>>>>>>>> array
> > > > > >>>>>>>>>>>>>>>>>>>>> format to Arrow — array-view array. The name
> is
> > > > > >> also
> > > > > >>>> up
> > > > > >>>>>> for
> > > > > >>>>>>>>>>>>>>> debate.
> > > > > >>>>>>>>>>>>>>>>>>>>> This format is inspired by Velox's
> ArrayVector
> > > > > >> format
> > > > > >>>>> [1].
> > > > > >>>>>>>>>>>>>>>> Logically,
> > > > > >>>>>>>>>>>>>>>>>>>> this
> > > > > >>>>>>>>>>>>>>>>>>>>> array represents an array of arrays. Each
> > element
> > > > > >> is
> > > > > >>>> an
> > > > > >>>>>>>>>>>>>>> array-view
> > > > > >>>>>>>>>>>>>>>>>>>> (offset
> > > > > >>>>>>>>>>>>>>>>>>>>> and size pair) that points to a range within
> a
> > > > > >> nested
> > > > > >>>>>>>>>> "values"
> > > > > >>>>>>>>>>>>>>>> array
> > > > > >>>>>>>>>>>>>>>>>>>>> (called "elements" in Velox docs). The nested
> > > > > >> array
> > > > > >>>> can
> > > > > >>>>> be
> > > > > >>>>>>> of
> > > > > >>>>>>>>>>> any
> > > > > >>>>>>>>>>>>>>>>> type,
> > > > > >>>>>>>>>>>>>>>>>>>>> which makes this format very flexible and
> > > > > >> powerful.
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> [image: ../_images/array-vector.png]
> > > > > >>>>>>>>>>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>
> > > https://facebookincubator.github.io/velox/_images/array-vector.png
> > > > > >>>>> <
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>
> > > > https://facebookincubator.github.io/velox/_images/array-vector.png
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> I'm currently working on a C++ implementation
> > and
> > > > > >> plan
> > > > > >>>>> to
> > > > > >>>>>>>>>> work
> > > > > >>>>>>>>>>>>>>> on a
> > > > > >>>>>>>>>>>>>>>>> Go
> > > > > >>>>>>>>>>>>>>>>>>>>> implementation to fulfill the
> > two-implementations
> > > > > >>>>>>> requirement
> > > > > >>>>>>>>>>> for
> > > > > >>>>>>>>>>>>>>>>> format
> > > > > >>>>>>>>>>>>>>>>>>>>> changes.
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> The draft design:
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets
> > > > > >> buffer,
> > > > > >>>>> int32
> > > > > >>>>>>>>>> sizes
> > > > > >>>>>>>>>>>>>>>>> buffer]
> > > > > >>>>>>>>>>>>>>>>>>>>> - 1 child array: "values" as an array of the
> > type
> > > > > >>>>>> parameter
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> validity_bitmap is used to differentiate
> > between
> > > > > >> empty
> > > > > >>>>>> array
> > > > > >>>>>>>>>>>>>>> views
> > > > > >>>>>>>>>>>>>>>>>>>>> (sizes[i] == 0) and NULL array views
> > > > > >>>> (validity_bitmap[i]
> > > > > >>>>>> ==
> > > > > >>>>>>>>>> 0).
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> When the validity_bitmap[i] is 0, both sizes
> > and
> > > > > >>>> offsets
> > > > > >>>>>> are
> > > > > >>>>>>>>>>>>>>>>> undefined
> > > > > >>>>>>>>>>>>>>>>>>>> (as
> > > > > >>>>>>>>>>>>>>>>>>>>> usual), and when sizes[i] == 0, offsets[i] is
> > > > > >>>>> undefined. 0
> > > > > >>>>>>> is
> > > > > >>>>>>>>>>>>>>>>> recommended
> > > > > >>>>>>>>>>>>>>>>>>>>> if setting a value is not an issue to the
> > system
> > > > > >>>>> producing
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> arrays.
> > > > > >>>>>>>>>>>>>>>>>>>>> offsets buffer is not required to be ordered
> > and
> > > > > >> views
> > > > > >>>>>> don't
> > > > > >>>>>>>>>>> have
> > > > > >>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>> be
> > > > > >>>>>>>>>>>>>>>>>>>>> disjoint.
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> [1]
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > > > >>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>>>>> Felipe O. Carvalho
> > > > > >>>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to