Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Felipe Oliveira Carvalho Wed, 14 Jun 2023 17:32:39 -0700

On Wed, Jun 14, 2023 at 3:07 PM Andrew Lamb <[email protected]> wrote:


> Arrow has at least 7 native "official" implementations (Java, Rust, Golang,
> C#, Javascript, Julia and C++), 5 bindings on C++ (C, Ruby, Python, R, and
> Matlab) and likely other implementations (like arrow2 in rust)


Yes, the introduction of every new format has O(number_of_languages)
complexity, but the complexity of the ListView format itself is low
relative to other formats.

We could also specify ListView to require a length-plus-1-long offsets
buffer to make it just like a ListArray with an added sizes buffer. If
offsets are sorted, a cast to ListArray would be trivial -- just share the
offsets buffer. Explaining the format would be easier as well I think.

I would love to discuss these details with the community.


>
I think it is worth remembering that depending on what level of support
> ListView aspires to, such an addition could require non trivial changes to
> many / all of those implementations (and the APIs they expose).
>
> Andrew
>
> On Wed, Jun 14, 2023 at 12:53 PM Felipe Oliveira Carvalho <
> [email protected]> wrote:
>
> > General approach to alternative formats aside, in the specific case of
> > ListView, I think the implementation complexity is being overestimated in
> > these discussions.
> >
> > The C++ Arrow implementation shares a lot of code between List and
> > LargeList. And with some tweaks, I'm able to share that common
> > infrastructure for ListView as well. [1]
> >
> > ListView is similar to list: it doesn't require offsets to be sorted and
> > adds an extra buffer containing sizes. For symmetry with the List and
> > LargeList types (FixedSizeList not included), I'm going to propose we
> add a
> > LargeListView. That is not part of the draft implementation yet, but
> seems
> > like an obvious thing to have now that I implemented the `if_else`
> > specialization. [2] David Li asked about this above and I can confirm now
> > that 64-bit version of ListView (LargeListView) is in the plans.
> >
> > Trying to avoid re-implementing some kernels is not a good goal to chase,
> > IMO, because kernels need tweaks to take advantage of the format.
> >
> > [1] https://github.com/apache/arrow/pull/35345
> > [2] https://github.com/felipecrv/arrow/commits/list_view_backup
> > --
> > Felipe
> >
> > On Wed, Jun 14, 2023 at 12:08 PM Weston Pace <[email protected]>
> > wrote:
> >
> > > > perhaps we could support this use-case as
> > > > a canonical extension type over dictionary encoded, variable-sized
> > > > arrays
> > >
> > > I believe this suggestion is valid and could be used to solve the
> if-else
> > > case.  The algorithm, if I understand it, would be roughly:
> > >
> > > ```
> > > // Note: Simple pseudocode, vectorization left as exercise for the
> reader
> > > auto indices_builder = ...
> > > auto list_builder = ...
> > > indices_builder.resize(batch.length);
> > > Array condition_mask = condition.EvaluateBatch(batch);
> > > for row_index in selected_rows(condition_mask):
> > >   indices_builder[row_index] = list_builder.CurrentLength();
> > >   list_builder.Append(if_expr.EvaluateRow(batch, row_index))
> > > for row_index in unselected_rows(condition_mask):
> > >   indices_builder[row_index] = list_builder.CurrentLength();
> > >   list_builder.Append(else_expr.EvaluateRow(batch, row_index))
> > > return DictionaryArray(indices_builder.Finish(), list_builder.Finish())
> > > ```
> > >
> > > I also agree this is a slightly awkward use of dictionaries (e.g. the
> > > dictionary would have the same size as the # of indices) and perhaps
> not
> > > the most intuitive way to solve the problem.
> > >
> > > My gut reaction is simply "an improved if/else kernel is not, alone,
> > enough
> > > justification for a new layout" and yet...
> > >
> > > I think we are seeing the start (I hope) of a trend where Arrow is not
> > just
> > > used "between systems" (e.g. to shuttle data from one place to another,
> > or
> > > between a query engine and a visualization tool) but also "within
> > systems"
> > > (e.g. UDFs, bespoke file formats and temporary tables, between workers
> > in a
> > > distributed query engine).  When arrow is used "within systems" I think
> > > both the number of bespoke formats and the significance of conversion
> > cost
> > > increases.  For example, it's easy to say that Velox should convert at
> > the
> > > boundary as data leaves Velox.  But what if Velox (or datafusion or
> ...)
> > > were to define an interface for UDFs.  Would we want to use Arrow there
> > > (e.g. the C data interface is a good fit)?  If so, wouldn't the
> > conversion
> > > cost be more significant?
> > >
> > > > Also, I'm very lukewarm towards the concept of "alternative layouts"
> > > > suggested somewhere else in this thread. It does not seem a good
> choice
> > > > to complexify the Arrow format that much.
> > >
> > > I think, in my opinion, this depends on how many of these alternative
> > > layouts exist.  If there are just a few, then I agree, we should just
> > adopt
> > > them as formal first-class layouts.  If there are many, then I think it
> > > will be too much complexity in Arrow to have all the different choices.
> > > Or, we could say there are many, but the alternatives don't belong in
> > Arrow
> > > at all.  In that case I think it's the same question as the above
> > > paragraph, "do we want Arrow to be used within systems?  Or just
> between
> > > systems?"
> > >
> > >
> > > On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > >
> > > > I agree that ListView cannot be an extension type, given that it
> > > > features a new layout, and therefore cannot reasonably be backed by
> an
> > > > existing storage type (AFAICT).
> > > >
> > > > Also, I'm very lukewarm towards the concept of "alternative layouts"
> > > > suggested somewhere else in this thread. It does not seem a good
> choice
> > > > to complexify the Arrow format that much.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 07/06/2023 à 00:21, Felipe Oliveira Carvalho a écrit :
> > > > > +1 on what Ian said.
> > > > >
> > > > > And as I write kernels for this new format, I’m learning that it’s
> > > > possible
> > > > > to re-use the common infrastructure used by List and LargeList to
> > > > implement
> > > > > the ListView related features with some adjustments.
> > > > >
> > > > > IMO having this format as a second-class citizen would more likely
> > > > > complicate things because it would make this unification harder.
> > > > >
> > > > > —
> > > > > Felipe
> > > > >
> > > > > On Tue, 6 Jun 2023 at 18:45 Ian Cook <[email protected]> wrote:
> > > > >
> > > > >> To clarify why we cannot simply propose adding ListView as a new
> > > > >> “canonical extension type”: The extension type mechanism in Arrow
> > > > >> depends on the underlying data being organized in an existing
> Arrow
> > > > >> layout—that way an implementation that does not support the
> > extension
> > > > >> type can still handle the underlying data. But ListView is a
> wholly
> > > > >> new layout.
> > > > >>
> > > > >> I strongly agree with Weston’s idea that it is a good time for
> Arrow
> > > > >> to introduce the notion of “canonical alternative layouts.”
> > > > >>
> > > > >> Taken together, I think that canonical extension types and
> canonical
> > > > >> alternative layouts could serve as an “incubator” for proposed new
> > > > >> representations. For example, if a proposed canonical alternative
> > > > >> layout ends up being broadly adopted, then that will serve as a
> > signal
> > > > >> that we should consider adding it as a primary layout in the core
> > > > >> spec.
> > > > >>
> > > > >> It seems to me that most projects that are implementing Arrow
> today
> > > > >> are not aiming to provide complete coverage of Arrow; rather they
> > are
> > > > >> adopting Arrow because of its role as a standard and they are
> > > > >> implementing only as much of the Arrow standard as they require to
> > > > >> achieve some goal. I believe that such projects are important
> Arrow
> > > > >> stakeholders, and I believe that this proposed notion of canonical
> > > > >> alternative layouts will serve them well and will create
> > efficiencies
> > > > >> by standardizing implementations around a shared set of
> > alternatives.
> > > > >>
> > > > >> However I think that the documentation for canonical alternative
> > > > >> layouts should strongly encourage implementers to default to using
> > the
> > > > >> primary layouts defined in the core spec and only use alternative
> > > > >> layouts in cases where the primary layouts do not meet their
> needs.
> > > > >>
> > > > >>
> > > > >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield <
> > > [email protected]>
> > > > >> wrote:
> > > > >>>
> > > > >>> This sounds reasonable to me but my main concern is, I'm not sure
> > > there
> > > > >> is
> > > > >>> a great mechanism to enforce canonical layouts don't somehow
> become
> > > > >> default
> > > > >>> (or the only implementation).
> > > > >>>
> > > > >>> Even for these new layouts, I think it might be worth rethinking
> > > > binding
> > > > >> a
> > > > >>> layout into the schema versus having a different concept of
> > encoding
> > > > (and
> > > > >>> changing some of the corresponding data structures).
> > > > >>>
> > > > >>>
> > > > >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace <
> > [email protected]>
> > > > >> wrote:
> > > > >>>
> > > > >>>> Trying to settle on one option is a fruitless endeavor.  Each
> type
> > > has
> > > > >> pros
> > > > >>>> and cons.  I would also predict that the largest existing usage
> of
> > > > >> Arrow is
> > > > >>>> shuttling data from one system to another.  The newly proposed
> > > format
> > > > >>>> doesn't appear to have any significant advantage for that use
> case
> > > (if
> > > > >>>> anything, the existing format is arguably better as it is more
> > > > >> compact).
> > > > >>>>
> > > > >>>> I am very biased towards historical precedent and avoiding
> > breaking
> > > > >>>> changes.
> > > > >>>>
> > > > >>>> We have "canonical extension types", perhaps it is time for
> > > "canonical
> > > > >>>> alternative layouts".  We could define it as such:
> > > > >>>>
> > > > >>>>   * There are one or more primary layouts
> > > > >>>>     * Existing layouts are automatically considered primary
> > layouts,
> > > > >> even if
> > > > >>>> they wouldn't
> > > > >>>>       have been primary layouts initially (e.g. large list)
> > > > >>>>   * A new layout, if it is semantically equivalent to another,
> is
> > > > >> considered
> > > > >>>> an alternative layout
> > > > >>>>   * An alternative layout still has the same requirements for
> > > adoption
> > > > >> (two
> > > > >>>> implementations and a vote)
> > > > >>>>     * An implementation should not feel pressured to rush and
> > > > implement
> > > > >> the
> > > > >>>> new layout.
> > > > >>>>       It would be good if they contribute in the discussion and
> > > > >> consider the
> > > > >>>> layout and vote
> > > > >>>>       if they feel it would be an acceptable design.
> > > > >>>>   * We can define and vote and approve as many canonical
> > alternative
> > > > >> layouts
> > > > >>>> as we want:
> > > > >>>>     * A canonical alternative layout should, at a minimum, have
> > some
> > > > >>>>       reasonable justification, such as improved performance for
> > > > >> algorithm X
> > > > >>>>   * Arrow implementations MUST support the primary layouts
> > > > >>>>   * An Arrow implementation MAY support a canonical alternative,
> > > > >> however:
> > > > >>>>     * An Arrow implementation MUST first support the primary
> > layout
> > > > >>>>     * An Arrow implementation MUST support conversion to/from
> the
> > > > >> primary
> > > > >>>> and canonical layout
> > > > >>>>     * An Arrow implementation's APIs MUST only provide data in
> the
> > > > >>>> alternative
> > > > >>>>       layout if it is explicitly asked for (e.g. schema
> inference
> > > > should
> > > > >>>> prefer the primary layout).
> > > > >>>>   * We can still vote for new primary layouts (e.g. promoting a
> > > > >> canonical
> > > > >>>> alternative) but, in these
> > > > >>>>      votes we don't only consider the value (e.g. performance)
> of
> > > the
> > > > >> layout
> > > > >>>> but also the interoperability.
> > > > >>>>      In other words, a layout can only become a primary layout
> if
> > > > there
> > > > >> is
> > > > >>>> significant evidence that most
> > > > >>>>      implementations plan to adopt it.
> > > > >>>>
> > > > >>>> This lets us evolve support for new layouts more naturally.  We
> > can
> > > > >>>> generally assume that users will not, initially, be aware of
> these
> > > > >>>> alternative layouts.  However, everything will just work.  They
> > may
> > > > >> start
> > > > >>>> to see a performance penalty stemming from a lack of support for
> > > these
> > > > >>>> layouts.  If this performance penalty becomes significant then
> > they
> > > > >> will
> > > > >>>> discover it and become aware of the problem.  They can then ask
> > > > >> whatever
> > > > >>>> library they are using to add support for the alternative
> layout.
> > > As
> > > > >>>> enough users find a need for it then libraries will add support.
> > > > >>>> Eventually, enough libraries will support it that we can adopt
> it
> > > as a
> > > > >>>> primary layout.
> > > > >>>>
> > > > >>>> Also, it allows libraries to adopt alternative layouts more
> > > > >> aggressively if
> > > > >>>> they would like while still hopefully ensuring that we
> eventually
> > > all
> > > > >>>> converge on the same implementation of the alternative layout.
> > > > >>>>
> > > > >>>> On Mon, May 22, 2023 at 9:35 AM Will Jones <
> > [email protected]
> > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hello Arrow devs,
> > > > >>>>>
> > > > >>>>> I don't understand why we would start deprecating features in
> the
> > > > >> Arrow
> > > > >>>>>> format. Even starting this talk might already be a bad idea
> > > > >> PR-wise.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> I agree we don't want to make breaking changes to the Arrow
> > format.
> > > > >> But
> > > > >>>>> several maintainers have already stated they have no interest
> in
> > > > >>>>> maintaining both list types with full compute functionality
> > [1][2],
> > > > >> so I
> > > > >>>>> think it's very likely one list type or the other will be
> > > > >>>>> implicitly preferred in the ecosystem if this data type was
> > added.
> > > > >> If
> > > > >>>>> that's the case, I'd prefer that we agreed as a community which
> > one
> > > > >>>> should
> > > > >>>>> be preferred. Maybe that's not the best path; it's just one way
> > for
> > > > >> us to
> > > > >>>>> balance stability, maintenance burden, and relevance.
> > > > >>>>>
> > > > >>>>> Can someone help distill down the primary rationale and usecase
> > for
> > > > >>>>>> adding ArrayView to the Arrow Spec?
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Looking back at that old thread, I think one of the main
> > > motivations
> > > > >> is
> > > > >>>> to
> > > > >>>>> try to prevent query engine implementers from feeling there is
> a
> > > > >> tradeoff
> > > > >>>>> between having state-of-the-art performance and being
> > Arrow-native.
> > > > >> For
> > > > >>>>> some of the new array types, we had both Velox and DuckDB use
> > them,
> > > > >> so it
> > > > >>>>> was reasonable to expect they were innovations that might
> > > > >> proliferate.
> > > > >>>> I'm
> > > > >>>>> not sure if the ArrayView is part of that. From Wes earlier
> [3]:
> > > > >>>>>
> > > > >>>>> The idea is that in a world of data and query federation (for
> > > > >> example,
> > > > >>>>>> consider [1] where Arrow is being used as a data federation
> > layer
> > > > >> with
> > > > >>>>> many
> > > > >>>>>> query engines), we want to increase the amount of data
> in-flight
> > > > >> and
> > > > >>>>>> in-memory that is in Arrow format. So if query engines are
> > having
> > > > >> to
> > > > >>>>> depart
> > > > >>>>>> substantially from the Arrow format to get performance, then
> > this
> > > > >>>>> creates a
> > > > >>>>>> potential lose-lose situation: * Depart from Arrow: get better
> > > > >>>>> performance
> > > > >>>>>> but pay serialization costs to read and write Arrow (the
> > > > >> performance
> > > > >>>> and
> > > > >>>>>> resource utilization benefits outweigh the serialization
> costs).
> > > > >> This
> > > > >>>>> puts
> > > > >>>>>> additional pressure on query engines to build specialized
> > > > >> components
> > > > >>>> for
> > > > >>>>>> solving problems rather than making use of off-the-shelf
> > > components
> > > > >>>> that
> > > > >>>>>> use Arrow. This has knock-on effects on ecosystem
> > fragmentation. *
> > > > >> Or
> > > > >>>> use
> > > > >>>>>> Arrow, and accept suboptimal query processing performance
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Will mentions one usecase is Velox consuming python UDF output,
> > > which
> > > > >>>> seems
> > > > >>>>>> to be mostly about how fast Velox can consume this format, not
> > how
> > > > >> fast
> > > > >>>>> it
> > > > >>>>>> can be written. Are there other usecases?
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> To be clear, I don't know if that's the use case they want.
> > That's
> > > > >> just
> > > > >>>> me
> > > > >>>>> speculating.
> > > > >>>>>
> > > > >>>>> I still have some questions myself:
> > > > >>>>>
> > > > >>>>> 1. Is this array type currently only used in Velox? (not DuckDB
> > > like
> > > > >> some
> > > > >>>>> of the other new types?) What evidence do we have that it will
> > > become
> > > > >>>> used
> > > > >>>>> outside of Velox?
> > > > >>>>> 2. We already have three list types: list, large list (64-bit
> > > > >> offsets),
> > > > >>>> and
> > > > >>>>> fixed size list. Do we think we will only want a view version
> of
> > > the
> > > > >>>> 32-bit
> > > > >>>>> offset variable length list? Or are we potentially talking
> about
> > > view
> > > > >>>>> variants for all three?
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>>
> > > > >>>>> Will Jones
> > > > >>>>>
> > > > >>>>> [1]
> > > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > > > >>>>> [2]
> > > https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3
> > > > >>>>> [3]
> > > https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t
> > > > >>>>>
> > > > >>>>> On Mon, May 22, 2023 at 3:48 AM Andrew Lamb <
> > [email protected]>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>>> Can someone help distill down the primary rationale and
> usecase
> > > for
> > > > >>>>>> adding ArrayView to the Arrow Spec?
> > > > >>>>>>
> > > > >>>>>>  From the above discussions, the stated rationale seems to be
> > fast
> > > > >>>>>> (zero-copy) interchange with Velox.
> > > > >>>>>>
> > > > >>>>>> This thread has qualitatively enumerated the benefits of
> > > > >> (offset+len)
> > > > >>>>>> encoding over the existing Arrow ListArray (offets) approach,
> > but
> > > I
> > > > >>>>> haven't
> > > > >>>>>> seen any performance measurements that might help us to gauge
> > the
> > > > >>>>> tradeoff
> > > > >>>>>> in additional complexity vs runtime overhead.
> > > > >>>>>>
> > > > >>>>>> Will mentions one usecase is Velox consuming python UDF
> output,
> > > > >> which
> > > > >>>>> seems
> > > > >>>>>> to be mostly about how fast Velox can consume this format, not
> > how
> > > > >> fast
> > > > >>>>> it
> > > > >>>>>> can be written. Are there other usecases?
> > > > >>>>>>
> > > > >>>>>> Do we have numbers showing how much overhead converting to
> /from
> > > > >>>> Velox's
> > > > >>>>>> internal representation and the existing ListArray adds? Has
> > > > >> anyone in
> > > > >>>>>> Velox land considered adding faster support for Arrow style
> > > > >> ListArray
> > > > >>>>>> encoding?
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Andrew
> > > > >>>>>>
> > > > >>>>>> On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou <
> > > [email protected]
> > > > >>>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Hi,
> > > > >>>>>>>
> > > > >>>>>>> I don't understand why we would start deprecating features in
> > the
> > > > >>>> Arrow
> > > > >>>>>>> format. Even starting this talk might already be a bad idea
> > > > >> PR-wise.
> > > > >>>>>>>
> > > > >>>>>>> As for implementing conversions at the I/O boundary, it's a
> > > > >>>> reasonably
> > > > >>>>>>> policy, but it still requires work by implementors and it's
> not
> > > > >>>> granted
> > > > >>>>>>> that all consumers of the Arrow format will grow such
> > conversions
> > > > >>>>>>> if/when we add non-trivial types such as ListView or
> > StringView.
> > > > >>>>>>>
> > > > >>>>>>> Regards
> > > > >>>>>>>
> > > > >>>>>>> Antoine.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Le 22/05/2023 à 00:39, Will Jones a écrit :
> > > > >>>>>>>> One more thing: Looking back on the previous discussion[1]
> > > > >> (which
> > > > >>>>>> Weston
> > > > >>>>>>>> pointed out in his earlier message), Jorge suggested that
> the
> > > > >> old
> > > > >>>>> list
> > > > >>>>>>>> types might be deprecated in favor of view variants [2].
> > Others
> > > > >>>> were
> > > > >>>>>>>> worried that it might undermine the perception that the
> Arrow
> > > > >>>> format
> > > > >>>>> is
> > > > >>>>>>>> stable. I think it might be worth thinking about "soft
> > > > >> deprecating"
> > > > >>>>> the
> > > > >>>>>>> old
> > > > >>>>>>>> list type: suggesting new implementations prefer the list
> > > > >> view, but
> > > > >>>>>>>> reassuring that implementations should support the old
> format,
> > > > >> even
> > > > >>>>> if
> > > > >>>>>>> they
> > > > >>>>>>>> just convert to the new format. To be clear, this wouldn't
> > > > >> mean we
> > > > >>>>> plan
> > > > >>>>>>> to
> > > > >>>>>>>> create breaking changes in the format; but if we ever did
> for
> > > > >> other
> > > > >>>>>>>> reasons, the old list type might go.
> > > > >>>>>>>>
> > > > >>>>>>>> Arrow compute libraries could choose either format for
> compute
> > > > >>>>> support,
> > > > >>>>>>> and
> > > > >>>>>>>> plan to do conversion at the boundaries. Libraries that use
> > > > >> the new
> > > > >>>>>> type
> > > > >>>>>>>> will have cheap conversion when reading the old type.
> > Meanwhile
> > > > >>>> those
> > > > >>>>>>> that
> > > > >>>>>>>> are still on the old type will have some incentive to move
> > > > >> towards
> > > > >>>>> the
> > > > >>>>>>> new
> > > > >>>>>>>> one, since that conversion will not be as efficient.
> > > > >>>>>>>>
> > > > >>>>>>>> [1]
> > > > >>>>
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > > >>>>>>>> [2]
> > > > >>>>
> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
> > > > >>>>>>>>
> > > > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones <
> > > > >>>> [email protected]>
> > > > >>>>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hello,
> > > > >>>>>>>>>
> > > > >>>>>>>>> I think Sasha brings up a good point, that the advantages
> of
> > > > >> this
> > > > >>>>>> format
> > > > >>>>>>>>> seem to be primarily about query processing. Other
> encodings
> > > > >> like
> > > > >>>>> REE
> > > > >>>>>>> and
> > > > >>>>>>>>> dictionary have space-saving advantages that justify them
> > > > >> simply
> > > > >>>> in
> > > > >>>>>>> terms
> > > > >>>>>>>>> of space efficiency (although they have query processing
> > > > >>>> advantages
> > > > >>>>> as
> > > > >>>>>>>>> well). As discussed, most use cases are already well served
> > by
> > > > >>>>>> existing
> > > > >>>>>>>>> list types and dictionary encoding.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I agree that there are cases where transferring this type
> > > > >> without
> > > > >>>>>>>>> conversion would be ideal. One use case I can think of is
> if
> > > > >> Velox
> > > > >>>>>>> wants to
> > > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with
> > > > >> PyArrow,
> > > > >>>> for
> > > > >>>>>>>>> example) that operate on this column type and therefore
> wants
> > > > >>>>>> zero-copy
> > > > >>>>>>>>> exchange over the C Data Interface.
> > > > >>>>>>>>>
> > > > >>>>>>>>> One big question I have: we already have three list types:
> > > > >> list,
> > > > >>>>> large
> > > > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we
> > > > >> will
> > > > >>>> only
> > > > >>>>>>> want a
> > > > >>>>>>>>> view version of the 32-bit offset variable length list? Or
> > > > >> are we
> > > > >>>>>>>>> potentially talking about view variants for all three?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Will Jones
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> > > > >>>>>>>>> [email protected]> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> The benefit of having a memory format that’s friendly to
> > > > >>>>>>> non-deterministic
> > > > >>>>>>>>>> order writes is unlocked by the transport and processing
> of
> > > > >> the
> > > > >>>>> data
> > > > >>>>>>> being
> > > > >>>>>>>>>> agnostic to the physical order as much as possible.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But
> it
> > > > >> can
> > > > >>>>> be a
> > > > >>>>>>>>>> provisory step for compatibility between systems that
> don’t
> > > > >>>>>> understand
> > > > >>>>>>> the
> > > > >>>>>>>>>> format yet. This is similar to the situation with
> > compression
> > > > >>>>> schemes
> > > > >>>>>>> like
> > > > >>>>>>>>>> run-end encoding — the goal is processing the compressed
> > data
> > > > >>>>>> directly
> > > > >>>>>>>>>> without an expansion step whenever possible.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> This is why having it as part of the open Arrow format is
> so
> > > > >>>>>> important:
> > > > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel
> > > > >> and/or
> > > > >>>>>>>>>> vectorized compute kernels without introducing multiple
> > > > >>>>> incompatible
> > > > >>>>>>>>>> formats to the ecosystem and without imposing a conversion
> > > > >> step
> > > > >>>>>> between
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>> different systems.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> —
> > > > >>>>>>>>>> Felipe
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin
> > > > >> <[email protected]>
> > > > >>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> I don't feel like this representation is necessarily a
> > > > >> detail of
> > > > >>>>> the
> > > > >>>>>>>>>> query
> > > > >>>>>>>>>>> engine, but I am also not sure why this representation
> > would
> > > > >>>> have
> > > > >>>>> to
> > > > >>>>>>> be
> > > > >>>>>>>>>>> converted to a non-view format when serializing. Could
> you
> > > > >>>> clarify
> > > > >>>>>>>>>> that? My
> > > > >>>>>>>>>>> impression is that this representation could be used for
> > > > >>>>> persistence
> > > > >>>>>>> or
> > > > >>>>>>>>>>> data transfer, though it can be more complex to guarantee
> > > > >> the
> > > > >>>>>> portion
> > > > >>>>>>> of
> > > > >>>>>>>>>>> the buffer that an index points to is also present in
> > > > >> memory.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Sent from Proton Mail for iOS
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > >>>>>>>>>> [email protected]
> > > > >>>>>>>>>>>
> > > > >> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>>
> > > > >>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Hi everyone,
> > > > >>>>>>>>>>> I understand that there are numerous benefits to this
> > > > >>>>> representation
> > > > >>>>>>>>>>> during query processing, but would it be fair to say that
> > > > >> this
> > > > >>>> is
> > > > >>>>> an
> > > > >>>>>>>>>>> implementation detail of the query engine? Query engines
> > > > >> don’t
> > > > >>>>>>>>>> necessarily
> > > > >>>>>>>>>>> need to conform to the Arrow format internally, only at
> > > > >>>>>> ingest/egress
> > > > >>>>>>>>>>> points, and performing a conversion from the non-view to
> > > > >> view
> > > > >>>>> format
> > > > >>>>>>>>>> seems
> > > > >>>>>>>>>>> like it would be very cheap (though I understand not
> > > > >> necessarily
> > > > >>>>> the
> > > > >>>>>>>>>> other
> > > > >>>>>>>>>>> way around, but you’d need to do that anyway if you’re
> > > > >>>>> serializing).
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Sasha Krassovsky
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> 20 мая 2023 г., в 13:00, Will Jones <
> > > > >> [email protected]>
> > > > >>>>>>>>>>> написал(а):
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks for sharing these details, Pedro. The
> conditional
> > > > >>>>> branches
> > > > >>>>>>>>>>> argument
> > > > >>>>>>>>>>>> makes a lot of sense to me.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> The tensors point brings up some interesting issues. For
> > > > >> now,
> > > > >>>>> we've
> > > > >>>>>>>>>>> defined
> > > > >>>>>>>>>>>> our only tensor extension type to be built on a fixed
> size
> > > > >>>> list.
> > > > >>>>>> If a
> > > > >>>>>>>>>> use
> > > > >>>>>>>>>>>> case of this might be manipulating tensors with zero
> copy,
> > > > >>>>> perhaps
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>> suggests that we want a fixed size list variant? In
> > > > >> addition,
> > > > >>>>> would
> > > > >>>>>>> we
> > > > >>>>>>>>>>> have
> > > > >>>>>>>>>>>> to define another extension type to be a ListView
> variant?
> > > > >> Or
> > > > >>>>> would
> > > > >>>>>>> we
> > > > >>>>>>>>>>> want
> > > > >>>>>>>>>>>> to think about making extension types somehow valid
> across
> > > > >>>>> various
> > > > >>>>>>>>>>>> encodings of the same "logical type"?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha
> > > > >> Pedreira
> > > > >>>>>>>>>>>>> <[email protected]> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> This is Pedro from the Velox team at Meta. This is my
> > > > >> first
> > > > >>>> time
> > > > >>>>>>>>>> here,
> > > > >>>>>>>>>>> so
> > > > >>>>>>>>>>>>> nice to e-meet you!
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Adding to what Felipe said, the main reason we created
> > > > >>>>> “ListView”
> > > > >>>>>>>>>>> (though
> > > > >>>>>>>>>>>>> we just call them ArrayVector/MapVector in Velox) is
> > that,
> > > > >>>> along
> > > > >>>>>>> with
> > > > >>>>>>>>>>>>> StringViews for strings, they allow us to write any
> > > > >> columnar
> > > > >>>>>> buffer
> > > > >>>>>>>>>>>>> out-or-order, regardless of their types or encodings.
> > > > >> This is
> > > > >>>>>>>>>> naturally
> > > > >>>>>>>>>>>>> doable for all primitive types (fixed-size), but not
> for
> > > > >> types
> > > > >>>>>> that
> > > > >>>>>>>>>>> don’t
> > > > >>>>>>>>>>>>> have fixed size and are required to be contiguous. The
> > > > >>>>> StringView
> > > > >>>>>>> and
> > > > >>>>>>>>>>>>> ListView formats allow us to keep this invariant in
> > Velox.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Being able to write vectors out-of-order is useful when
> > > > >>>>> executing
> > > > >>>>>>>>>>>>> conditionals like IF/SWITCH statements, which are
> > > > >> pervasive
> > > > >>>>> among
> > > > >>>>>>> our
> > > > >>>>>>>>>>>>> workloads. To fully vectorize it, one first evaluates
> the
> > > > >>>>>>> expression,
> > > > >>>>>>>>>>> then
> > > > >>>>>>>>>>>>> generate a bitmap containing which rows take the THEN
> and
> > > > >>>> which
> > > > >>>>>> take
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> ELSE branch. Then you populate all rows that match the
> > > > >> first
> > > > >>>>>> branch
> > > > >>>>>>>>>> by
> > > > >>>>>>>>>>>>> evaluating the THEN expression in a vectorized
> > > > >> (branch-less
> > > > >>>> and
> > > > >>>>>>> cache
> > > > >>>>>>>>>>>>> friendly) way, and subsequently the ELSE branch. If you
> > > > >> can’t
> > > > >>>>>> write
> > > > >>>>>>>>>> them
> > > > >>>>>>>>>>>>> out-of-order, you would either have a big branch per
> row
> > > > >>>>>> dispatching
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>> right expression (slow), or populate two distinct
> vectors
> > > > >> then
> > > > >>>>>>>>>> merging
> > > > >>>>>>>>>>> them
> > > > >>>>>>>>>>>>> at the end (potentially even slower). How much faster
> our
> > > > >>>>> approach
> > > > >>>>>>> is
> > > > >>>>>>>>>>>>> highly depends on the buffer sizes and expressions, but
> > we
> > > > >>>> found
> > > > >>>>>> it
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>> be
> > > > >>>>>>>>>>>>> faster enough on average to justify us extending the
> > > > >>>> underlying
> > > > >>>>>>>>>> layout.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> With that said, this is all within a single thread of
> > > > >>>> execution.
> > > > >>>>>>>>>>>>> Parallelization is done by feeding each thread its own
> > > > >>>>>> vector/data.
> > > > >>>>>>>>>> As
> > > > >>>>>>>>>>>>> pointed out in a previous message, this also gives you
> > the
> > > > >>>>>>>>>> flexibility
> > > > >>>>>>>>>>> to
> > > > >>>>>>>>>>>>> implement cardinality increasing/reducing operations,
> but
> > > > >> we
> > > > >>>>> don’t
> > > > >>>>>>>>>> use
> > > > >>>>>>>>>>> it
> > > > >>>>>>>>>>>>> for that purpose. Operations like filtering, joining,
> > > > >>>> unnesting
> > > > >>>>>> and
> > > > >>>>>>>>>>> similar
> > > > >>>>>>>>>>>>> are done by wrapping the internal vector in a
> dictionary,
> > > > >> as
> > > > >>>>> these
> > > > >>>>>>>>>> need
> > > > >>>>>>>>>>> to
> > > > >>>>>>>>>>>>> work not only on “ListViews” but on any data types with
> > > > >> any
> > > > >>>>>>> encoding.
> > > > >>>>>>>>>>> There
> > > > >>>>>>>>>>>>> are more details on Section 4.2.1 in [1]
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Beyond this, it also gives function/kernel developers
> > more
> > > > >>>>>>>>>> flexibility
> > > > >>>>>>>>>>> to
> > > > >>>>>>>>>>>>> implement operations that manipulate Arrays/Maps. For
> > > > >> example,
> > > > >>>>>>>>>>> operations
> > > > >>>>>>>>>>>>> that slice these containers can be implemented in a
> > > > >> zero-copy
> > > > >>>>>> manner
> > > > >>>>>>>>>> by
> > > > >>>>>>>>>>>>> just rearranging the lengths/offsets indices, without
> > ever
> > > > >>>>>> touching
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> larger internal buffers. This is a similar motivation
> as
> > > > >> for
> > > > >>>>>>>>>> StringView
> > > > >>>>>>>>>>>>> (think of substr(), trim(), and similar). One nice last
> > > > >>>> property
> > > > >>>>>> is
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>>> this layout allows for overlapping ranges. This is
> > > > >> something
> > > > >>>>>>>>>> discussed
> > > > >>>>>>>>>>> with
> > > > >>>>>>>>>>>>> our ML people to allow deduping feature values in a
> > tensor
> > > > >>>>> (which
> > > > >>>>>> is
> > > > >>>>>>>>>>> fairly
> > > > >>>>>>>>>>>>> common), but not something we have leveraged just yet.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>> --
> > > > >>>>>>>>>>>>> Pedro Pedreira
> > > > >>>>>>>>>>>>> ________________________________
> > > > >>>>>>>>>>>>> From: Felipe Oliveira Carvalho <[email protected]>
> > > > >>>>>>>>>>>>> Sent: Friday, May 19, 2023 10:01 AM
> > > > >>>>>>>>>>>>> To: [email protected] <[email protected]>
> > > > >>>>>>>>>>>>> Cc: Pedro Eugenio Rocha Pedreira <[email protected]>
> > > > >>>>>>>>>>>>> Subject: Re: [DISCUSS][Format] Starting the draft
> > > > >>>> implementation
> > > > >>>>>> of
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> ArrayView array format
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael
> > > > >> Taylor-Davies
> > > > >>>>> <r.
> > > > >>>>>>>>>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi
> All, >
> > > > >> if
> > > > >>>> we
> > > > >>>>>>> added
> > > > >>>>>>>>>>>>> this, do we think many Arrow and query > engine
> > > > >>>> implementations
> > > > >>>>>> (for
> > > > >>>>>>>>>>>>> example, DataFusion) will be
> > > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerStart
> > > > >>>>>>>>>>>>> This Message Is From an External Sender
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerEnd
> > > > >>>>>>>>>>>>> +pedroerp
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
> > > > >>>>>>>>>>>>> <[email protected]> wrote:
> > > > >>>>>>>>>>>>> Hi All,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> if we added this, do we think many Arrow and query
> > > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will
> be
> > > > >>>> eager
> > > > >>>>> to
> > > > >>>>>>>>>> add
> > > > >>>>>>>>>>>>> full
> > > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or
> are
> > > > >> they
> > > > >>>>>> likely
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>>> just
> > > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > > > >>>>>>>>>>>>> I can't speak for query engines in general, but at
> least
> > > > >> for
> > > > >>>>>>> arrow-rs
> > > > >>>>>>>>>>>>> and by extension DataFusion, and based on my current
> > > > >>>>> understanding
> > > > >>>>>>> of
> > > > >>>>>>>>>>>>> the use-cases I would be rather hesitant to add support
> > > > >> to the
> > > > >>>>>>>>>> kernels
> > > > >>>>>>>>>>>>> for this array type, definitely instead favouring
> > > > >> conversion
> > > > >>>> at
> > > > >>>>>> the
> > > > >>>>>>>>>>>>> edges. We already have issues with the amount of code
> > > > >>>> generation
> > > > >>>>>>>>>>>>> resulting in binary bloat and long compile times, and I
> > > > >> worry
> > > > >>>>> this
> > > > >>>>>>>>>> would
> > > > >>>>>>>>>>>>> worsen this situation whilst not really providing
> > > > >> compelling
> > > > >>>>>>>>>> advantages
> > > > >>>>>>>>>>>>> for the vast majority of workloads that don't interact
> > > > >> with
> > > > >>>>> Velox.
> > > > >>>>>>>>>>>>> Whilst I can definitely see that the ListView
> > > > >> representation
> > > > >>>> is
> > > > >>>>>>>>>> probably
> > > > >>>>>>>>>>>>> a better way to represent variable length lists than
> what
> > > > >>>> arrow
> > > > >>>>>>>>>> settled
> > > > >>>>>>>>>>>>> upon, I'm not yet convinced it is sufficiently better
> to
> > > > >>>>>> incentivise
> > > > >>>>>>>>>>>>> broad ecosystem adoption.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Kind Regards,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Raphael Taylor-Davies
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On 11/05/2023 21:20, Will Jones wrote:
> > > > >>>>>>>>>>>>>> Hi Felipe,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Thanks for the additional details.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data
> to
> > > > >> the
> > > > >>>>>> array
> > > > >>>>>>>>>> from
> > > > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > > > >> Only the
> > > > >>>>>>>>>> offsets
> > > > >>>>>>>>>>>>> array
> > > > >>>>>>>>>>>>>>> has to be written according to logical order but that
> > is
> > > > >>>>>>>>>> potentially a
> > > > >>>>>>>>>>>>> much
> > > > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> It still seems to me like applications are still
> pretty
> > > > >>>> niche,
> > > > >>>>>> as I
> > > > >>>>>>>>>>>>> suspect
> > > > >>>>>>>>>>>>>> in most cases the benefits are outweighed by the
> costs.
> > > > >> The
> > > > >>>>>> benefit
> > > > >>>>>>>>>>> here
> > > > >>>>>>>>>>>>>> seems pretty limited: if you are trying to split work
> > > > >> between
> > > > >>>>>>>>>> threads,
> > > > >>>>>>>>>>>>>> usually you will have other levels such as array
> chunks
> > > > >> to
> > > > >>>>>>>>>> parallelize.
> > > > >>>>>>>>>>>>> And
> > > > >>>>>>>>>>>>>> if you have an incoming stream of row data, you'll
> want
> > > > >> to
> > > > >>>>> append
> > > > >>>>>>> in
> > > > >>>>>>>>>>>>>> predictable order to match the order of the other
> > > > >> arrays. Am
> > > > >>>> I
> > > > >>>>>>>>>> missing
> > > > >>>>>>>>>>>>>> something?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> And, IIUC, the cost of using ListView with
> out-of-order
> > > > >>>> values
> > > > >>>>>> over
> > > > >>>>>>>>>>>>>> ListArray is you lose memory locality; the values of
> > > > >> element
> > > > >>>> 2
> > > > >>>>>> are
> > > > >>>>>>>>>> no
> > > > >>>>>>>>>>>>>> longer adjacent to the values of element 1. What do
> you
> > > > >> think
> > > > >>>>>> about
> > > > >>>>>>>>>>> that
> > > > >>>>>>>>>>>>>> tradeoff?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> I don't mean to be difficult about this. I'm excited
> for
> > > > >> both
> > > > >>>>> the
> > > > >>>>>>>>>> REE
> > > > >>>>>>>>>>> and
> > > > >>>>>>>>>>>>>> StringView arrays, but this one I'm not so sure about
> > > > >> yet. I
> > > > >>>>>>> suppose
> > > > >>>>>>>>>>>>> what I
> > > > >>>>>>>>>>>>>> am trying to ask is, if we added this, do we think
> many
> > > > >> Arrow
> > > > >>>>> and
> > > > >>>>>>>>>> query
> > > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will
> be
> > > > >>>> eager
> > > > >>>>> to
> > > > >>>>>>>>>> add
> > > > >>>>>>>>>>>>> full
> > > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or
> are
> > > > >> they
> > > > >>>>>> likely
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>>> just
> > > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Because if it turns out to be the latter, then we
> might
> > > > >> as
> > > > >>>> well
> > > > >>>>>> ask
> > > > >>>>>>>>>>> Velox
> > > > >>>>>>>>>>>>>> to export this type as ListArray and save the rest of
> > the
> > > > >>>>>> ecosystem
> > > > >>>>>>>>>>> some
> > > > >>>>>>>>>>>>>> work.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Will Jones
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira
> > > > >> Carvalho <
> > > > >>>>>>>>>>>>>> [email protected]<mailto:[email protected]>>
> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Initial reason for ListView arrays in Arrow is
> > zero-copy
> > > > >>>>>>>>>> compatibility
> > > > >>>>>>>>>>>>> with
> > > > >>>>>>>>>>>>>>> Velox which uses this format.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data
> to
> > > > >> the
> > > > >>>>>> array
> > > > >>>>>>>>>> from
> > > > >>>>>>>>>>>>>>> different threads without care for strict ordering.
> > > > >> Only the
> > > > >>>>>>>>>> offsets
> > > > >>>>>>>>>>>>> array
> > > > >>>>>>>>>>>>>>> has to be written according to logical order but that
> > is
> > > > >>>>>>>>>> potentially a
> > > > >>>>>>>>>>>>> much
> > > > >>>>>>>>>>>>>>> smaller buffer than the values buffer.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Acero kernels could take advantage of that in the
> > > > >> future.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> In implementing ListViewArray/Type I was able to
> reuse
> > > > >> some
> > > > >>>>> C++
> > > > >>>>>>>>>>>>> templates
> > > > >>>>>>>>>>>>>>> used for ListArray which can reduce some of the
> burden
> > > > >> on
> > > > >>>>> kernel
> > > > >>>>>>>>>>>>>>> implementations that aim to work with all the types.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> I’m can fix Acero kernels for working with ListView.
> > > > >> This is
> > > > >>>>>>>>>> similar
> > > > >>>>>>>>>>> to
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> work I’ve doing in kernels dealing with run-end
> encoded
> > > > >>>>> arrays.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> —
> > > > >>>>>>>>>>>>>>> Felipe
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Wed, 26 Apr 2023 at 01:03 Will Jones <
> > > > >>>>>> [email protected]
> > > > >>>>>>>>>>>>> <mailto:[email protected]>> wrote:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> I suppose one common use case is materializing list
> > > > >> columns
> > > > >>>>>> after
> > > > >>>>>>>>>>> some
> > > > >>>>>>>>>>>>>>>> expanding operation like a join or unnest. That's a
> > > > >> case
> > > > >>>>> where
> > > > >>>>>> I
> > > > >>>>>>>>>>> could
> > > > >>>>>>>>>>>>>>>> imagine a lot of repetition of values. Haven't yet
> > > > >> thought
> > > > >>>> of
> > > > >>>>>>>>>> common
> > > > >>>>>>>>>>>>>>> cases
> > > > >>>>>>>>>>>>>>>> where there is overlap but not full duplication, but
> > am
> > > > >>>> eager
> > > > >>>>>> to
> > > > >>>>>>>>>> hear
> > > > >>>>>>>>>>>>>>> any.
> > > > >>>>>>>>>>>>>>>> The dictionary encoding point Raphael makes is
> > > > >> interesting,
> > > > >>>>>>>>>>> especially
> > > > >>>>>>>>>>>>>>>> given the existence of LargeList and FixedSizeList.
> > For
> > > > >>>> many
> > > > >>>>>>>>>>>>> operations,
> > > > >>>>>>>>>>>>>>> it
> > > > >>>>>>>>>>>>>>>> might make more sense to just compose those existing
> > > > >> types.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> IIUC the operations that would be unique to the
> > > > >> ArrayView
> > > > >>>> are
> > > > >>>>>>> ones
> > > > >>>>>>>>>>>>>>> altering
> > > > >>>>>>>>>>>>>>>> the shape. One could truncate each array to a
> certain
> > > > >>>> length
> > > > >>>>>>>>>> cheaply
> > > > >>>>>>>>>>>>>>> simply
> > > > >>>>>>>>>>>>>>>> by replacing the sizes buffer. Or perhaps there are
> > > > >>>>> interesting
> > > > >>>>>>>>>>>>>>> operations
> > > > >>>>>>>>>>>>>>>> on tensors that would benefit.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 7:47 PM Raphael
> Taylor-Davies
> > > > >>>>>>>>>>>>>>>> <[email protected]> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Unless I am missing something, I think the
> selection
> > > > >>>>> use-case
> > > > >>>>>>>>>> could
> > > > >>>>>>>>>>> be
> > > > >>>>>>>>>>>>>>>>> equally well served by a dictionary-encoded
> > > > >>>>>>> BinarArray/ListArray,
> > > > >>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>> would
> > > > >>>>>>>>>>>>>>>>> have the benefit of not requiring any modifications
> > > > >> to the
> > > > >>>>>>>>>> existing
> > > > >>>>>>>>>>>>>>>> format
> > > > >>>>>>>>>>>>>>>>> or kernels.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> The major additional flexibility of the proposed
> > > > >> encoding
> > > > >>>>>> would
> > > > >>>>>>>>>> be
> > > > >>>>>>>>>>>>>>>>> permitting disjoint or overlapping ranges, are
> these
> > > > >>>> common
> > > > >>>>>>>>>> enough
> > > > >>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>> practice to represent a meaningful bottleneck?
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On 26 April 2023 01:40:14 BST, David Li <
> > > > >>>>> [email protected]
> > > > >>>>>>>>>>> <mailto:
> > > > >>>>>>>>>>>>> [email protected]>> wrote:
> > > > >>>>>>>>>>>>>>>>>> Is there a need for a 64-bit offsets version the
> > > > >> same way
> > > > >>>>> we
> > > > >>>>>>>>>> have
> > > > >>>>>>>>>>>>> List
> > > > >>>>>>>>>>>>>>>>> and LargeList?
> > > > >>>>>>>>>>>>>>>>>> And just to be clear, the difference with List is
> > > > >> that
> > > > >>>> the
> > > > >>>>>>> lists
> > > > >>>>>>>>>>>>> don't
> > > > >>>>>>>>>>>>>>>>> have to be stored in their logical order (or in
> other
> > > > >>>> words,
> > > > >>>>>>>>>> offsets
> > > > >>>>>>>>>>>>> do
> > > > >>>>>>>>>>>>>>>> not
> > > > >>>>>>>>>>>>>>>>> have to be nondecreasing and so we also need
> sizes)?
> > > > >>>>>>>>>>>>>>>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > > > >>>>>>>>>>>>>>>>>>> For context, there was some discussion on this
> back
> > > > >> in
> > > > >>>>> [1].
> > > > >>>>>> At
> > > > >>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>>>> time
> > > > >>>>>>>>>>>>>>>>>>> this was called "sequence view" but I do not like
> > > > >> that
> > > > >>>>> name.
> > > > >>>>>>>>>>>>>>> However,
> > > > >>>>>>>>>>>>>>>>>>> array-view array is a little confusing. Given
> this
> > > > >> is
> > > > >>>>>> similar
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>>>>> list
> > > > >>>>>>>>>>>>>>>>> can
> > > > >>>>>>>>>>>>>>>>>>> we go with list-view array?
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested
> to
> > > > >> hear
> > > > >>>>>> about
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors,
> > > > >> and in
> > > > >>>>> what
> > > > >>>>>>>>>>>>>>>> situations
> > > > >>>>>>>>>>>>>>>>>>> they
> > > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > > > >> current
> > > > >>>>>>>>>> ListArray
> > > > >>>>>>>>>>>>>>>>>>>> implementations.
> > > > >>>>>>>>>>>>>>>>>>> I believe one significant benefit is that take
> (and
> > > > >> by
> > > > >>>>>> proxy,
> > > > >>>>>>>>>>>>>>> filter)
> > > > >>>>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>>> sort are O(# of items) with the proposed format
> and
> > > > >> O(#
> > > > >>>> of
> > > > >>>>>>>>>> bytes)
> > > > >>>>>>>>>>>>>>> with
> > > > >>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>> current format. Jorge did some profiling to this
> > > > >> effect
> > > > >>>> in
> > > > >>>>>>> [1].
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> [1]
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>
> > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<
> > > > >>>>>>>>>>>>>
> > > > >>>>>
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > >
> > > > >>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones <
> > > > >>>>>>>>>>> [email protected]
> > > > >>>>>>>>>>>>> <mailto:[email protected]>
> > > > >>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>> Hi Felipe,
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested
> to
> > > > >> hear
> > > > >>>>>> about
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors,
> > > > >> and in
> > > > >>>>> what
> > > > >>>>>>>>>>>>>>>> situations
> > > > >>>>>>>>>>>>>>>>> they
> > > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the
> > > > >> current
> > > > >>>>>>>>>> ListArray
> > > > >>>>>>>>>>>>>>>>>>>> implementations.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> IIUC it would be fairly cheap to transform a
> > > > >> ListArray
> > > > >>>> to
> > > > >>>>>> an
> > > > >>>>>>>>>>>>>>>>> ArrayView, but
> > > > >>>>>>>>>>>>>>>>>>>> expensive to go the other way.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Will Jones
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira
> > > > >>>> Carvalho
> > > > >>>>> <
> > > > >>>>>>>>>>>>>>>>>>>> [email protected]<mailto:[email protected]
> >>
> > > > >>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Hi folks,
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> I would like to start a public discussion on
> the
> > > > >>>>> inclusion
> > > > >>>>>>>>>> of a
> > > > >>>>>>>>>>>>>>> new
> > > > >>>>>>>>>>>>>>>>> array
> > > > >>>>>>>>>>>>>>>>>>>>> format to Arrow — array-view array. The name is
> > > > >> also
> > > > >>>> up
> > > > >>>>>> for
> > > > >>>>>>>>>>>>>>> debate.
> > > > >>>>>>>>>>>>>>>>>>>>> This format is inspired by Velox's ArrayVector
> > > > >> format
> > > > >>>>> [1].
> > > > >>>>>>>>>>>>>>>> Logically,
> > > > >>>>>>>>>>>>>>>>>>>> this
> > > > >>>>>>>>>>>>>>>>>>>>> array represents an array of arrays. Each
> element
> > > > >> is
> > > > >>>> an
> > > > >>>>>>>>>>>>>>> array-view
> > > > >>>>>>>>>>>>>>>>>>>> (offset
> > > > >>>>>>>>>>>>>>>>>>>>> and size pair) that points to a range within a
> > > > >> nested
> > > > >>>>>>>>>> "values"
> > > > >>>>>>>>>>>>>>>> array
> > > > >>>>>>>>>>>>>>>>>>>>> (called "elements" in Velox docs). The nested
> > > > >> array
> > > > >>>> can
> > > > >>>>> be
> > > > >>>>>>> of
> > > > >>>>>>>>>>> any
> > > > >>>>>>>>>>>>>>>>> type,
> > > > >>>>>>>>>>>>>>>>>>>>> which makes this format very flexible and
> > > > >> powerful.
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> [image: ../_images/array-vector.png]
> > > > >>>>>>>>>>>>>>>>>>>>> <
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>
> > https://facebookincubator.github.io/velox/_images/array-vector.png
> > > > >>>>> <
> > > > >>>>>>>>>>>>>
> > > > >>>>>>
> > > https://facebookincubator.github.io/velox/_images/array-vector.png
> > > > >>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> I'm currently working on a C++ implementation
> and
> > > > >> plan
> > > > >>>>> to
> > > > >>>>>>>>>> work
> > > > >>>>>>>>>>>>>>> on a
> > > > >>>>>>>>>>>>>>>>> Go
> > > > >>>>>>>>>>>>>>>>>>>>> implementation to fulfill the
> two-implementations
> > > > >>>>>>> requirement
> > > > >>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>>>> format
> > > > >>>>>>>>>>>>>>>>>>>>> changes.
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> The draft design:
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets
> > > > >> buffer,
> > > > >>>>> int32
> > > > >>>>>>>>>> sizes
> > > > >>>>>>>>>>>>>>>>> buffer]
> > > > >>>>>>>>>>>>>>>>>>>>> - 1 child array: "values" as an array of the
> type
> > > > >>>>>> parameter
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> validity_bitmap is used to differentiate
> between
> > > > >> empty
> > > > >>>>>> array
> > > > >>>>>>>>>>>>>>> views
> > > > >>>>>>>>>>>>>>>>>>>>> (sizes[i] == 0) and NULL array views
> > > > >>>> (validity_bitmap[i]
> > > > >>>>>> ==
> > > > >>>>>>>>>> 0).
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> When the validity_bitmap[i] is 0, both sizes
> and
> > > > >>>> offsets
> > > > >>>>>> are
> > > > >>>>>>>>>>>>>>>>> undefined
> > > > >>>>>>>>>>>>>>>>>>>> (as
> > > > >>>>>>>>>>>>>>>>>>>>> usual), and when sizes[i] == 0, offsets[i] is
> > > > >>>>> undefined. 0
> > > > >>>>>>> is
> > > > >>>>>>>>>>>>>>>>> recommended
> > > > >>>>>>>>>>>>>>>>>>>>> if setting a value is not an issue to the
> system
> > > > >>>>> producing
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>> arrays.
> > > > >>>>>>>>>>>>>>>>>>>>> offsets buffer is not required to be ordered
> and
> > > > >> views
> > > > >>>>>> don't
> > > > >>>>>>>>>>> have
> > > > >>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>>>>>>>>> disjoint.
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> [1]
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > > >>>>>>>>>>>>> <
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>>>> Felipe O. Carvalho
> > > > >>>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to