On Wed, Jun 14, 2023 at 5:07 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> Even something relatively straightforward becomes a huge implementation
> effort when multiplied by a large number of codebases, users and
> datasets. Parquet is a great source of historical examples of the
> challenges of incremental changes that don't meaningfully unlock new
> use-cases. To take just one, Int96 was deprecated almost a decade ago,
> in favour of some additional metadata over an existing physical layout,
> and yet Int96 is still to the best of my knowledge used by Spark by
> default.
>
> That's not to say that I think the arrow specification should ossify and
> we should never change it, but I'm not hugely enthusiastic about adding
> encodings that are only incremental improvements over existing encodings.
>
> I therefore wonder if there are some new use-cases I am missing that
> would be unlocked by this change, and that wouldn't be supported by the
> dictionary proposal? Perhaps you could elaborate here?


The dict-encoded ListArray would add a memory indirection: the slot in the
dictionary has to be resolved before the offset can be. With ListView, the
CPU can access offset and size without waiting on that memory access. Both
need two buffer accesses after the validity check, but ListView doesn't
have that dependency.


> Whilst I do agree
> using dictionaries as proposed is perhaps a less elegant solution, I
> don't see anything inherently wrong with it, and if it ain't broke we
> really shouldn't be trying to fix it.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho
> <felipe...@gmail.com> wrote:
>
>     General approach to alternative formats aside, in the specific case
>     of ListView, I think the implementation complexity is being
>     overestimated in these discussions. The C++ Arrow implementation
>     shares a lot of code between List and LargeList. And with some
>     tweaks, I'm able to share that common infrastructure for ListView as
>     well. [1] ListView is similar to list: it doesn't require offsets to
>     be sorted and adds an extra buffer containing sizes. For symmetry
>     with the List and LargeList types (FixedSizeList not included), I'm
>     going to propose we add a LargeListView. That is not part of the
>     draft implementation yet, but seems like an obvious thing to have
>     now that I implemented the `if_else` specialization. [2] David Li
>     asked about this above and I can confirm now that 64-bit version of
>     ListView (LargeListView) is in the plans. Trying to avoid
>     re-implementing some kernels is not a good goal to chase, IMO,
>     because kernels need tweaks to take advantage of the format. [1]
>     https://github.com/apache/arrow/pull/35345 [2]
>     https://github.com/felipecrv/arrow/commits/list_view_backup --
>     Felipe On Wed, Jun 14, 2023 at 12:08 PM Weston Pace
>     <weston.p...@gmail.com> wrote:
>
>             perhaps we could support this use-case as a canonical
>             extension type over dictionary encoded, variable-sized arrays
>
>         I believe this suggestion is valid and could be used to solve
>         the if-else case. The algorithm, if I understand it, would be
>         roughly: ``` // Note: Simple pseudocode, vectorization left as
>         exercise for the reader auto indices_builder = ... auto
>         list_builder = ... indices_builder.resize(batch.length); Array
>         condition_mask = condition.EvaluateBatch(batch); for row_index
>         in selected_rows(condition_mask): indices_builder[row_index] =
>         list_builder.CurrentLength();
>         list_builder.Append(if_expr.EvaluateRow(batch, row_index)) for
>         row_index in unselected_rows(condition_mask):
>         indices_builder[row_index] = list_builder.CurrentLength();
>         list_builder.Append(else_expr.EvaluateRow(batch, row_index))
>         return DictionaryArray(indices_builder.Finish(),
>         list_builder.Finish()) ``` I also agree this is a slightly
>         awkward use of dictionaries (e.g. the dictionary would have the
>         same size as the # of indices) and perhaps not the most
>         intuitive way to solve the problem. My gut reaction is simply
>         "an improved if/else kernel is not, alone, enough justification
>         for a new layout" and yet... I think we are seeing the start (I
>         hope) of a trend where Arrow is not just used "between systems"
>         (e.g. to shuttle data from one place to another, or between a
>         query engine and a visualization tool) but also "within systems"
>         (e.g. UDFs, bespoke file formats and temporary tables, between
>         workers in a distributed query engine). When arrow is used
>         "within systems" I think both the number of bespoke formats and
>         the significance of conversion cost increases. For example, it's
>         easy to say that Velox should convert at the boundary as data
>         leaves Velox. But what if Velox (or datafusion or ...) were to
>         define an interface for UDFs. Would we want to use Arrow there
>         (e.g. the C data interface is a good fit)? If so, wouldn't the
>         conversion cost be more significant?
>
>             Also, I'm very lukewarm towards the concept of "alternative
>             layouts" suggested somewhere else in this thread. It does
>             not seem a good choice to complexify the Arrow format that
>             much.
>
>         I think, in my opinion, this depends on how many of these
>         alternative layouts exist. If there are just a few, then I
>         agree, we should just adopt them as formal first-class layouts.
>         If there are many, then I think it will be too much complexity
>         in Arrow to have all the different choices. Or, we could say
>         there are many, but the alternatives don't belong in Arrow at
>         all. In that case I think it's the same question as the above
>         paragraph, "do we want Arrow to be used within systems? Or just
>         between systems?" On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou
>         <anto...@python.org> wrote:
>
>             I agree that ListView cannot be an extension type, given
>             that it features a new layout, and therefore cannot
>             reasonably be backed by an existing storage type (AFAICT).
>             Also, I'm very lukewarm towards the concept of "alternative
>             layouts" suggested somewhere else in this thread. It does
>             not seem a good choice to complexify the Arrow format that
>             much. Regards Antoine. Le 07/06/2023 à 00:21, Felipe
>             Oliveira Carvalho a écrit :
>
>                 +1 on what Ian said. And as I write kernels for this new
>                 format, I’m learning that it’s
>
>             possible
>
>                 to re-use the common infrastructure used by List and
>                 LargeList to
>
>             implement
>
>                 the ListView related features with some adjustments. IMO
>                 having this format as a second-class citizen would more
>                 likely complicate things because it would make this
>                 unification harder. — Felipe On Tue, 6 Jun 2023 at 18:45
>                 Ian Cook <ianmc...@apache.org> wrote:
>
>                     To clarify why we cannot simply propose adding
>                     ListView as a new “canonical extension type”: The
>                     extension type mechanism in Arrow depends on the
>                     underlying data being organized in an existing Arrow
>                     layout—that way an implementation that does not
>                     support the extension type can still handle the
>                     underlying data. But ListView is a wholly new
>                     layout. I strongly agree with Weston’s idea that it
>                     is a good time for Arrow to introduce the notion of
>                     “canonical alternative layouts.” Taken together, I
>                     think that canonical extension types and canonical
>                     alternative layouts could serve as an “incubator”
>                     for proposed new representations. For example, if a
>                     proposed canonical alternative layout ends up being
>                     broadly adopted, then that will serve as a signal
>                     that we should consider adding it as a primary
>                     layout in the core spec. It seems to me that most
>                     projects that are implementing Arrow today are not
>                     aiming to provide complete coverage of Arrow; rather
>                     they are adopting Arrow because of its role as a
>                     standard and they are implementing only as much of
>                     the Arrow standard as they require to achieve some
>                     goal. I believe that such projects are important
>                     Arrow stakeholders, and I believe that this proposed
>                     notion of canonical alternative layouts will serve
>                     them well and will create efficiencies by
>                     standardizing implementations around a shared set of
>                     alternatives. However I think that the documentation
>                     for canonical alternative layouts should strongly
>                     encourage implementers to default to using the
>                     primary layouts defined in the core spec and only
>                     use alternative layouts in cases where the primary
>                     layouts do not meet their needs. On Sat, May 27,
>                     2023 at 7:44 PM Micah Kornfield <
>
>         emkornfi...@gmail.com>
>
>                     wrote:
>
>                         This sounds reasonable to me but my main concern
>                         is, I'm not sure
>
>         there
>
>                     is
>
>                         a great mechanism to enforce canonical layouts
>                         don't somehow become
>
>                     default
>
>                         (or the only implementation). Even for these new
>                         layouts, I think it might be worth rethinking
>
>             binding
>
>                     a
>
>                         layout into the schema versus having a different
>                         concept of encoding
>
>             (and
>
>                         changing some of the corresponding data
>                         structures). On Mon, May 22, 2023 at 10:37 AM
>                         Weston Pace <weston.p...@gmail.com>
>
>                     wrote:
>
>                             Trying to settle on one option is a
>                             fruitless endeavor. Each type
>
>         has
>
>                     pros
>
>                             and cons. I would also predict that the
>                             largest existing usage of
>
>                     Arrow is
>
>                             shuttling data from one system to another.
>                             The newly proposed
>
>         format
>
>                             doesn't appear to have any significant
>                             advantage for that use case
>
>         (if
>
>                             anything, the existing format is arguably
>                             better as it is more
>
>                     compact).
>
>                             I am very biased towards historical
>                             precedent and avoiding breaking changes. We
>                             have "canonical extension types", perhaps it
>                             is time for
>
>         "canonical
>
>                             alternative layouts". We could define it as
>                             such: * There are one or more primary
>                             layouts * Existing layouts are automatically
>                             considered primary layouts,
>
>                     even if
>
>                             they wouldn't have been primary layouts
>                             initially (e.g. large list) * A new layout,
>                             if it is semantically equivalent to another,
> is
>
>                     considered
>
>                             an alternative layout * An alternative
>                             layout still has the same requirements for
>
>         adoption
>
>                     (two
>
>                             implementations and a vote) * An
>                             implementation should not feel pressured to
>                             rush and
>
>             implement
>
>                     the
>
>                             new layout. It would be good if they
>                             contribute in the discussion and
>
>                     consider the
>
>                             layout and vote if they feel it would be an
>                             acceptable design. * We can define and vote
>                             and approve as many canonical alternative
>
>                     layouts
>
>                             as we want: * A canonical alternative layout
>                             should, at a minimum, have some reasonable
>                             justification, such as improved performance
> for
>
>                     algorithm X
>
>                             * Arrow implementations MUST support the
>                             primary layouts * An Arrow implementation
>                             MAY support a canonical alternative,
>
>                     however:
>
>                             * An Arrow implementation MUST first support
>                             the primary layout * An Arrow implementation
>                             MUST support conversion to/from the
>
>                     primary
>
>                             and canonical layout * An Arrow
>                             implementation's APIs MUST only provide data
>                             in the alternative layout if it is
>                             explicitly asked for (e.g. schema inference
>
>             should
>
>                             prefer the primary layout). * We can still
>                             vote for new primary layouts (e.g. promoting a
>
>                     canonical
>
>                             alternative) but, in these votes we don't
>                             only consider the value (e.g. performance) of
>
>         the
>
>                     layout
>
>                             but also the interoperability. In other
>                             words, a layout can only become a primary
>                             layout if
>
>             there
>
>                     is
>
>                             significant evidence that most
>                             implementations plan to adopt it. This lets
>                             us evolve support for new layouts more
>                             naturally. We can generally assume that
>                             users will not, initially, be aware of these
>                             alternative layouts. However, everything
>                             will just work. They may
>
>                     start
>
>                             to see a performance penalty stemming from a
>                             lack of support for
>
>         these
>
>                             layouts. If this performance penalty becomes
>                             significant then they
>
>                     will
>
>                             discover it and become aware of the problem.
>                             They can then ask
>
>                     whatever
>
>                             library they are using to add support for
>                             the alternative layout.
>
>         As
>
>                             enough users find a need for it then
>                             libraries will add support. Eventually,
>                             enough libraries will support it that we can
>                             adopt it
>
>         as a
>
>                             primary layout. Also, it allows libraries to
>                             adopt alternative layouts more
>
>                     aggressively if
>
>                             they would like while still hopefully
>                             ensuring that we eventually
>
>         all
>
>                             converge on the same implementation of the
>                             alternative layout. On Mon, May 22, 2023 at
>                             9:35 AM Will Jones <will.jones...@gmail.com
>
>                             wrote:
>
>                                 Hello Arrow devs, I don't understand why
>                                 we would start deprecating features in the
>
>                     Arrow
>
>                                     format. Even starting this talk
>                                     might already be a bad idea
>
>                     PR-wise.
>
>                                 I agree we don't want to make breaking
>                                 changes to the Arrow format.
>
>                     But
>
>                                 several maintainers have already stated
>                                 they have no interest in maintaining
>                                 both list types with full compute
>                                 functionality [1][2],
>
>                     so I
>
>                                 think it's very likely one list type or
>                                 the other will be implicitly preferred
>                                 in the ecosystem if this data type was
>                                 added.
>
>                     If
>
>                                 that's the case, I'd prefer that we
>                                 agreed as a community which one
>
>                             should
>
>                                 be preferred. Maybe that's not the best
>                                 path; it's just one way for
>
>                     us to
>
>                                 balance stability, maintenance burden,
>                                 and relevance. Can someone help distill
>                                 down the primary rationale and usecase for
>
>                                     adding ArrayView to the Arrow Spec?
>
>                                 Looking back at that old thread, I think
>                                 one of the main
>
>         motivations
>
>                     is
>
>                             to
>
>                                 try to prevent query engine implementers
>                                 from feeling there is a
>
>                     tradeoff
>
>                                 between having state-of-the-art
>                                 performance and being Arrow-native.
>
>                     For
>
>                                 some of the new array types, we had both
>                                 Velox and DuckDB use them,
>
>                     so it
>
>                                 was reasonable to expect they were
>                                 innovations that might
>
>                     proliferate.
>
>                             I'm
>
>                                 not sure if the ArrayView is part of
>                                 that. From Wes earlier [3]: The idea is
>                                 that in a world of data and query
>                                 federation (for
>
>                     example,
>
>                                     consider [1] where Arrow is being
>                                     used as a data federation layer
>
>                     with
>
>                                 many
>
>                                     query engines), we want to increase
>                                     the amount of data in-flight
>
>                     and
>
>                                     in-memory that is in Arrow format.
>                                     So if query engines are having
>
>                     to
>
>                                 depart
>
>                                     substantially from the Arrow format
>                                     to get performance, then this
>
>                                 creates a
>
>                                     potential lose-lose situation: *
>                                     Depart from Arrow: get better
>
>                                 performance
>
>                                     but pay serialization costs to read
>                                     and write Arrow (the
>
>                     performance
>
>                             and
>
>                                     resource utilization benefits
>                                     outweigh the serialization costs).
>
>                     This
>
>                                 puts
>
>                                     additional pressure on query engines
>                                     to build specialized
>
>                     components
>
>                             for
>
>                                     solving problems rather than making
>                                     use of off-the-shelf
>
>         components
>
>                             that
>
>                                     use Arrow. This has knock-on effects
>                                     on ecosystem fragmentation. *
>
>                     Or
>
>                             use
>
>                                     Arrow, and accept suboptimal query
>                                     processing performance
>
>                                 Will mentions one usecase is Velox
>                                 consuming python UDF output,
>
>         which
>
>                             seems
>
>                                     to be mostly about how fast Velox
>                                     can consume this format, not how
>
>                     fast
>
>                                 it
>
>                                     can be written. Are there other
>                                     usecases?
>
>                                 To be clear, I don't know if that's the
>                                 use case they want. That's
>
>                     just
>
>                             me
>
>                                 speculating. I still have some questions
>                                 myself: 1. Is this array type currently
>                                 only used in Velox? (not DuckDB
>
>         like
>
>                     some
>
>                                 of the other new types?) What evidence
>                                 do we have that it will
>
>         become
>
>                             used
>
>                                 outside of Velox? 2. We already have
>                                 three list types: list, large list (64-bit
>
>                     offsets),
>
>                             and
>
>                                 fixed size list. Do we think we will
>                                 only want a view version of
>
>         the
>
>                             32-bit
>
>                                 offset variable length list? Or are we
>                                 potentially talking about
>
>         view
>
>                                 variants for all three? Best, Will Jones
>                                 [1]
>
>         https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
>
>                                 [2]
>
>         https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3
>
>                                 [3]
>
>         https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t
>
>                                 On Mon, May 22, 2023 at 3:48 AM Andrew
>                                 Lamb <al...@influxdata.com>
>
>                             wrote:
>
>                                     Can someone help distill down the
>                                     primary rationale and usecase
>
>         for
>
>                                     adding ArrayView to the Arrow Spec?
>                                      From the above discussions, the
>                                     stated rationale seems to be fast
>                                     (zero-copy) interchange with Velox.
>                                     This thread has qualitatively
>                                     enumerated the benefits of
>
>                     (offset+len)
>
>                                     encoding over the existing Arrow
>                                     ListArray (offets) approach, but
>
>         I
>
>                                 haven't
>
>                                     seen any performance measurements
>                                     that might help us to gauge the
>
>                                 tradeoff
>
>                                     in additional complexity vs runtime
>                                     overhead. Will mentions one usecase
>                                     is Velox consuming python UDF output,
>
>                     which
>
>                                 seems
>
>                                     to be mostly about how fast Velox
>                                     can consume this format, not how
>
>                     fast
>
>                                 it
>
>                                     can be written. Are there other
>                                     usecases? Do we have numbers showing
>                                     how much overhead converting to /from
>
>                             Velox's
>
>                                     internal representation and the
>                                     existing ListArray adds? Has
>
>                     anyone in
>
>                                     Velox land considered adding faster
>                                     support for Arrow style
>
>                     ListArray
>
>                                     encoding? Andrew On Mon, May 22,
>                                     2023 at 4:38 AM Antoine Pitrou <
>
>         anto...@python.org
>
>                                 wrote:
>
>                                         Hi, I don't understand why we
>                                         would start deprecating features
>                                         in the
>
>                             Arrow
>
>                                         format. Even starting this talk
>                                         might already be a bad idea
>
>                     PR-wise.
>
>                                         As for implementing conversions
>                                         at the I/O boundary, it's a
>
>                             reasonably
>
>                                         policy, but it still requires
>                                         work by implementors and it's not
>
>                             granted
>
>                                         that all consumers of the Arrow
>                                         format will grow such
>                                         conversions if/when we add
>                                         non-trivial types such as
>                                         ListView or StringView. Regards
>                                         Antoine. Le 22/05/2023 à 00:39,
>                                         Will Jones a écrit :
>
>                                             One more thing: Looking back
>                                             on the previous discussion[1]
>
>                     (which
>
>                                     Weston
>
>                                             pointed out in his earlier
>                                             message), Jorge suggested
>                                             that the
>
>                     old
>
>                                 list
>
>                                             types might be deprecated in
>                                             favor of view variants [2].
>                                             Others
>
>                             were
>
>                                             worried that it might
>                                             undermine the perception
>                                             that the Arrow
>
>                             format
>
>                                 is
>
>                                             stable. I think it might be
>                                             worth thinking about "soft
>
>                     deprecating"
>
>                                 the
>
>                                         old
>
>                                             list type: suggesting new
>                                             implementations prefer the
> list
>
>                     view, but
>
>                                             reassuring that
>                                             implementations should
>                                             support the old format,
>
>                     even
>
>                                 if
>
>                                         they
>
>                                             just convert to the new
>                                             format. To be clear, this
>                                             wouldn't
>
>                     mean we
>
>                                 plan
>
>                                         to
>
>                                             create breaking changes in
>                                             the format; but if we ever
>                                             did for
>
>                     other
>
>                                             reasons, the old list type
>                                             might go. Arrow compute
>                                             libraries could choose
>                                             either format for compute
>
>                                 support,
>
>                                         and
>
>                                             plan to do conversion at the
>                                             boundaries. Libraries that use
>
>                     the new
>
>                                     type
>
>                                             will have cheap conversion
>                                             when reading the old type.
>                                             Meanwhile
>
>                             those
>
>                                         that
>
>                                             are still on the old type
>                                             will have some incentive to
>                                             move
>
>                     towards
>
>                                 the
>
>                                         new
>
>                                             one, since that conversion
>                                             will not be as efficient. [1]
>
>
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>
>
>                                             [2]
>
>
> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
>
>
>                                             On Sun, May 21, 2023 at
>                                             3:07 PM Will Jones <
>
>                             will.jones...@gmail.com>
>
>                                         wrote:
>
>                                                 Hello, I think Sasha
>                                                 brings up a good point,
>                                                 that the advantages of
>
>                     this
>
>                                     format
>
>                                                 seem to be primarily
>                                                 about query processing.
>                                                 Other encodings
>
>                     like
>
>                                 REE
>
>                                         and
>
>                                                 dictionary have
>                                                 space-saving advantages
>                                                 that justify them
>
>                     simply
>
>                             in
>
>                                         terms
>
>                                                 of space efficiency
>                                                 (although they have
>                                                 query processing
>
>                             advantages
>
>                                 as
>
>                                                 well). As discussed,
>                                                 most use cases are
>                                                 already well served by
>
>                                     existing
>
>                                                 list types and
>                                                 dictionary encoding. I
>                                                 agree that there are
>                                                 cases where transferring
>                                                 this type
>
>                     without
>
>                                                 conversion would be
>                                                 ideal. One use case I
>                                                 can think of is if
>
>                     Velox
>
>                                         wants to
>
>                                                 be able to take
>                                                 Arrow-based UDFs
>                                                 (possibly written with
>
>                     PyArrow,
>
>                             for
>
>                                                 example) that operate on
>                                                 this column type and
>                                                 therefore wants
>
>                                     zero-copy
>
>                                                 exchange over the C Data
>                                                 Interface. One big
>                                                 question I have: we
>                                                 already have three list
>                                                 types:
>
>                     list,
>
>                                 large
>
>                                                 list (64-bit offsets),
>                                                 and fixed size list. Do
>                                                 we think we
>
>                     will
>
>                             only
>
>                                         want a
>
>                                                 view version of the
>                                                 32-bit offset variable
>                                                 length list? Or
>
>                     are we
>
>                                                 potentially talking
>                                                 about view variants for
>                                                 all three? Best, Will
>                                                 Jones On Sun, May 21,
>                                                 2023 at 2:19 PM Felipe
>                                                 Oliveira Carvalho <
>                                                 felipe...@gmail.com>
> wrote:
>
>                                                     The benefit of
>                                                     having a memory
>                                                     format that’s
>                                                     friendly to
>
>                                         non-deterministic
>
>                                                     order writes is
>                                                     unlocked by the
>                                                     transport and
>                                                     processing of
>
>                     the
>
>                                 data
>
>                                         being
>
>                                                     agnostic to the
>                                                     physical order as
>                                                     much as possible.
>                                                     Requiring a
>                                                     conversion could
>                                                     cancel out that
>                                                     benefit. But it
>
>                     can
>
>                                 be a
>
>                                                     provisory step for
>                                                     compatibility
>                                                     between systems that
>                                                     don’t
>
>                                     understand
>
>                                         the
>
>                                                     format yet. This is
>                                                     similar to the
>                                                     situation with
>                                                     compression
>
>                                 schemes
>
>                                         like
>
>                                                     run-end encoding —
>                                                     the goal is
>                                                     processing the
>                                                     compressed data
>
>                                     directly
>
>                                                     without an expansion
>                                                     step whenever
>                                                     possible. This is
>                                                     why having it as
>                                                     part of the open
>                                                     Arrow format is so
>
>                                     important:
>
>                                                     everyone can agree
>                                                     on a format that’s
>                                                     friendly to parallel
>
>                     and/or
>
>                                                     vectorized compute
>                                                     kernels without
>                                                     introducing multiple
>
>                                 incompatible
>
>                                                     formats to the
>                                                     ecosystem and
>                                                     without imposing a
>                                                     conversion
>
>                     step
>
>                                     between
>
>                                                     the different
>                                                     systems. — Felipe On
>                                                     Sat, 20 May 2023 at
>                                                     20:04 Aldrin
>
>                     <octalene....@pm.me.invalid>
>
>                                         wrote:
>
>                                                         I don't feel
>                                                         like this
>                                                         representation
>                                                         is necessarily a
>
>                     detail of
>
>                                 the
>
>                                                     query
>
>                                                         engine, but I am
>                                                         also not sure
>                                                         why this
>                                                         representation
>                                                         would
>
>                             have
>
>                                 to
>
>                                         be
>
>                                                         converted to a
>                                                         non-view format
>                                                         when
>                                                         serializing.
>                                                         Could you
>
>                             clarify
>
>                                                     that? My
>
>                                                         impression is
>                                                         that this
>                                                         representation
>                                                         could be used for
>
>                                 persistence
>
>                                         or
>
>                                                         data transfer,
>                                                         though it can be
>                                                         more complex to
>                                                         guarantee
>
>                     the
>
>                                     portion
>
>                                         of
>
>                                                         the buffer that
>                                                         an index points
>                                                         to is also
>                                                         present in
>
>                     memory.
>
>                                                         Sent from Proton
>                                                         Mail for iOS On
>                                                         Sat, May 20,
>                                                         2023 at 15:00,
>                                                         Sasha Krassovsky <
>
>
> krassovskysa...@gmail.com
>
>
>
> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>>
>
>
>                                         wrote:
>
>                                                         Hi everyone, I
>                                                         understand that
>                                                         there are
>                                                         numerous
>                                                         benefits to this
>
>                                 representation
>
>                                                         during query
>                                                         processing, but
>                                                         would it be fair
>                                                         to say that
>
>                     this
>
>                             is
>
>                                 an
>
>                                                         implementation
>                                                         detail of the
>                                                         query engine?
>                                                         Query engines
>
>                     don’t
>
>                                                     necessarily
>
>                                                         need to conform
>                                                         to the Arrow
>                                                         format
>                                                         internally, only
> at
>
>                                     ingest/egress
>
>                                                         points, and
>                                                         performing a
>                                                         conversion from
>                                                         the non-view to
>
>                     view
>
>                                 format
>
>                                                     seems
>
>                                                         like it would be
>                                                         very cheap
>                                                         (though I
>                                                         understand not
>
>                     necessarily
>
>                                 the
>
>                                                     other
>
>                                                         way around, but
>                                                         you’d need to do
>                                                         that anyway if
>                                                         you’re
>
>                                 serializing).
>
>                                                         Sasha Krassovsky
>
>                                                             20 мая 2023
>                                                             г., в 13:00,
>                                                             Will Jones <
>
>                     will.jones...@gmail.com>
>
>                                                         написал(а):
>
>                                                             Thanks for
>                                                             sharing
>                                                             these
>                                                             details,
>                                                             Pedro. The
>                                                             conditional
>
>                                 branches
>
>                                                         argument
>
>                                                             makes a lot
>                                                             of sense to
>                                                             me. The
>                                                             tensors
>                                                             point brings
>                                                             up some
>                                                             interesting
>                                                             issues. For
>
>                     now,
>
>                                 we've
>
>                                                         defined
>
>                                                             our only
>                                                             tensor
>                                                             extension
>                                                             type to be
>                                                             built on a
>                                                             fixed size
>
>                             list.
>
>                                     If a
>
>                                                     use
>
>                                                             case of this
>                                                             might be
>                                                             manipulating
>                                                             tensors with
>                                                             zero copy,
>
>                                 perhaps
>
>                                                     that
>
>                                                             suggests
>                                                             that we want
>                                                             a fixed size
>                                                             list
>                                                             variant? In
>
>                     addition,
>
>                                 would
>
>                                         we
>
>                                                         have
>
>                                                             to define
>                                                             another
>                                                             extension
>                                                             type to be a
>                                                             ListView
>                                                             variant?
>
>                     Or
>
>                                 would
>
>                                         we
>
>                                                         want
>
>                                                             to think
>                                                             about making
>                                                             extension
>                                                             types
>                                                             somehow
>                                                             valid across
>
>                                 various
>
>                                                             encodings of
>                                                             the same
>                                                             "logical type"?
>
>                                                                 On Fri,
>                                                                 May 19,
>                                                                 2023 at
>                                                                 1:59 PM
>                                                                 Pedro
>                                                                 Eugenio
>                                                                 Rocha
>
>                     Pedreira
>
>
> <pedro...@meta.com.invalid>
>                                                                 wrote:
>                                                                 Hi all,
>                                                                 This is
>                                                                 Pedro
>                                                                 from the
>                                                                 Velox
>                                                                 team at
>                                                                 Meta.
>                                                                 This is my
>
>                     first
>
>                             time
>
>                                                     here,
>
>                                                         so
>
>                                                                 nice to
>                                                                 e-meet
>                                                                 you!
>                                                                 Adding
>                                                                 to what
>                                                                 Felipe
>                                                                 said,
>                                                                 the main
>                                                                 reason
>                                                                 we created
>
>                                 “ListView”
>
>                                                         (though
>
>                                                                 we just
>                                                                 call
>                                                                 them
>
> ArrayVector/MapVector
>                                                                 in
>                                                                 Velox)
>                                                                 is that,
>
>                             along
>
>                                         with
>
>                                                                 StringViews
>                                                                 for
>                                                                 strings,
>                                                                 they
>                                                                 allow us
>                                                                 to write
>                                                                 any
>
>                     columnar
>
>                                     buffer
>
>
> out-or-order,
>                                                                 regardless
>                                                                 of their
>                                                                 types or
>                                                                 encodings.
>
>                     This is
>
>                                                     naturally
>
>                                                                 doable
>                                                                 for all
>                                                                 primitive
>                                                                 types
>
> (fixed-size),
>                                                                 but not
> for
>
>                     types
>
>                                     that
>
>                                                         don’t
>
>                                                                 have
>                                                                 fixed
>                                                                 size and
>                                                                 are
>                                                                 required
>                                                                 to be
>                                                                 contiguous.
>                                                                 The
>
>                                 StringView
>
>                                         and
>
>                                                                 ListView
>                                                                 formats
>                                                                 allow us
>                                                                 to keep
>                                                                 this
>                                                                 invariant
>                                                                 in
>                                                                 Velox.
>                                                                 Being
>                                                                 able to
>                                                                 write
>                                                                 vectors
>
> out-of-order
>                                                                 is
>                                                                 useful
> when
>
>                                 executing
>
>
> conditionals
>                                                                 like
>                                                                 IF/SWITCH
>                                                                 statements,
>                                                                 which are
>
>                     pervasive
>
>                                 among
>
>                                         our
>
>                                                                 workloads.
>                                                                 To fully
>                                                                 vectorize
>                                                                 it, one
>                                                                 first
>                                                                 evaluates
>                                                                 the
>
>                                         expression,
>
>                                                         then
>
>                                                                 generate
>                                                                 a bitmap
>                                                                 containing
>                                                                 which
>                                                                 rows
>                                                                 take the
>                                                                 THEN and
>
>                             which
>
>                                     take
>
>                                                     the
>
>                                                                 ELSE
>                                                                 branch.
>                                                                 Then you
>                                                                 populate
>                                                                 all rows
>                                                                 that
>                                                                 match the
>
>                     first
>
>                                     branch
>
>                                                     by
>
>                                                                 evaluating
>                                                                 the THEN
>                                                                 expression
>                                                                 in a
>                                                                 vectorized
>
>                     (branch-less
>
>                             and
>
>                                         cache
>
>                                                                 friendly)
>                                                                 way, and
>
> subsequently
>                                                                 the ELSE
>                                                                 branch.
>                                                                 If you
>
>                     can’t
>
>                                     write
>
>                                                     them
>
>
> out-of-order,
>                                                                 you
>                                                                 would
>                                                                 either
>                                                                 have a
>                                                                 big
>                                                                 branch
>                                                                 per row
>
>                                     dispatching
>
>                                                     to
>
>                                                         the
>
>                                                                 right
>                                                                 expression
>                                                                 (slow),
>                                                                 or
>                                                                 populate
>                                                                 two
>                                                                 distinct
>                                                                 vectors
>
>                     then
>
>                                                     merging
>
>                                                         them
>
>                                                                 at the
>                                                                 end
>
> (potentially
>                                                                 even
>                                                                 slower).
>                                                                 How much
>                                                                 faster our
>
>                                 approach
>
>                                         is
>
>                                                                 highly
>                                                                 depends
>                                                                 on the
>                                                                 buffer
>                                                                 sizes
>                                                                 and
>
> expressions,
>                                                                 but we
>
>                             found
>
>                                     it
>
>                                                     to
>
>                                                         be
>
>                                                                 faster
>                                                                 enough
>                                                                 on
>                                                                 average
>                                                                 to
>                                                                 justify
>                                                                 us
>                                                                 extending
>                                                                 the
>
>                             underlying
>
>                                                     layout.
>
>                                                                 With
>                                                                 that
>                                                                 said,
>                                                                 this is
>                                                                 all
>                                                                 within a
>                                                                 single
>                                                                 thread of
>
>                             execution.
>
>
> Parallelization
>                                                                 is done
>                                                                 by
>                                                                 feeding
>                                                                 each
>                                                                 thread
>                                                                 its own
>
>                                     vector/data.
>
>                                                     As
>
>                                                                 pointed
>                                                                 out in a
>                                                                 previous
>                                                                 message,
>                                                                 this
>                                                                 also
>                                                                 gives
>                                                                 you the
>
>                                                     flexibility
>
>                                                         to
>
>                                                                 implement
>                                                                 cardinality
>
> increasing/reducing
>                                                                 operations,
>                                                                 but
>
>                     we
>
>                                 don’t
>
>                                                     use
>
>                                                         it
>
>                                                                 for that
>                                                                 purpose.
>                                                                 Operations
>                                                                 like
>                                                                 filtering,
>                                                                 joining,
>
>                             unnesting
>
>                                     and
>
>                                                         similar
>
>                                                                 are done
>                                                                 by
>                                                                 wrapping
>                                                                 the
>                                                                 internal
>                                                                 vector
>                                                                 in a
>
> dictionary,
>
>                     as
>
>                                 these
>
>                                                     need
>
>                                                         to
>
>                                                                 work not
>                                                                 only on
>                                                                 “ListViews”
>                                                                 but on
>                                                                 any data
>                                                                 types with
>
>                     any
>
>                                         encoding.
>
>                                                         There
>
>                                                                 are more
>                                                                 details
>                                                                 on
>                                                                 Section
>                                                                 4.2.1 in
>                                                                 [1]
>                                                                 Beyond
>                                                                 this, it
>                                                                 also
>                                                                 gives
>
> function/kernel
>                                                                 developers
>                                                                 more
>
>                                                     flexibility
>
>                                                         to
>
>                                                                 implement
>                                                                 operations
>                                                                 that
>                                                                 manipulate
>
> Arrays/Maps.
>                                                                 For
>
>                     example,
>
>                                                         operations
>
>                                                                 that
>                                                                 slice
>                                                                 these
>                                                                 containers
>                                                                 can be
>                                                                 implemented
>                                                                 in a
>
>                     zero-copy
>
>                                     manner
>
>                                                     by
>
>                                                                 just
>                                                                 rearranging
>                                                                 the
>
> lengths/offsets
>                                                                 indices,
>                                                                 without
>                                                                 ever
>
>                                     touching
>
>                                                     the
>
>                                                                 larger
>                                                                 internal
>                                                                 buffers.
>                                                                 This is
>                                                                 a
>                                                                 similar
>                                                                 motivation
>                                                                 as
>
>                     for
>
>                                                     StringView
>
>                                                                 (think
>                                                                 of
>                                                                 substr(),
>                                                                 trim(),
>                                                                 and
>                                                                 similar).
>                                                                 One nice
>                                                                 last
>
>                             property
>
>                                     is
>
>                                                     that
>
>                                                                 this
>                                                                 layout
>                                                                 allows
>                                                                 for
>                                                                 overlapping
>                                                                 ranges.
>                                                                 This is
>
>                     something
>
>                                                     discussed
>
>                                                         with
>
>                                                                 our ML
>                                                                 people
>                                                                 to allow
>                                                                 deduping
>                                                                 feature
>                                                                 values
>                                                                 in a
> tensor
>
>                                 (which
>
>                                     is
>
>                                                         fairly
>
>                                                                 common),
>                                                                 but not
>                                                                 something
>                                                                 we have
>                                                                 leveraged
>                                                                 just
>                                                                 yet. [1]
>                                                                 -
>
> https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
>                                                                 Best, --
>                                                                 Pedro
>                                                                 Pedreira
>
> ------------------------------------------------------------------------
>                                                                 From:
>                                                                 Felipe
>                                                                 Oliveira
>                                                                 Carvalho
>                                                                 <
> felipe...@gmail.com>
>                                                                 Sent:
>                                                                 Friday,
>                                                                 May 19,
>                                                                 2023
>                                                                 10:01 AM
>                                                                 To:
>
> dev@arrow.apache.org
>                                                                 <
> dev@arrow.apache.org>
>                                                                 Cc:
>                                                                 Pedro
>                                                                 Eugenio
>                                                                 Rocha
>                                                                 Pedreira
>                                                                 <
> pedro...@meta.com>
>                                                                 Subject:
>                                                                 Re:
>
> [DISCUSS][Format]
>                                                                 Starting
>                                                                 the draft
>
>                             implementation
>
>                                     of
>
>                                                     the
>
>                                                                 ArrayView
>                                                                 array
>                                                                 format
>                                                                 +pedroerp
>                                                                 On Thu,
>                                                                 11 May
>                                                                 2023 at
>                                                                 17: 51
>                                                                 Raphael
>
>                     Taylor-Davies
>
>                                 <r.
>
>
> taylordavies@
>                                                                 googlemail.
>                                                                 com.
>                                                                 invalid>
>                                                                 wrote:
>                                                                 Hi All, >
>
>                     if
>
>                             we
>
>                                         added
>
>                                                                 this, do
>                                                                 we think
>                                                                 many
>                                                                 Arrow
>                                                                 and
>                                                                 query >
>                                                                 engine
>
>                             implementations
>
>                                     (for
>
>                                                                 example,
>                                                                 DataFusion)
>                                                                 will be
>
> ZjQcmQRYFpfptBannerStart
>                                                                 This
>                                                                 Message
>                                                                 Is From
>                                                                 an
>                                                                 External
>                                                                 Sender
>
> ZjQcmQRYFpfptBannerEnd
>                                                                 +pedroerp
>                                                                 On Thu,
>                                                                 11 May
>                                                                 2023 at
>                                                                 17:51
>                                                                 Raphael
>
> Taylor-Davies
>
> <r.taylordav...@googlemail.com.invalid>
>                                                                 wrote:
>                                                                 Hi All,
>
>                                                                     if
>                                                                     we
>                                                                     added
>                                                                     this,
>                                                                     do
>                                                                     we
>                                                                     think
>                                                                     many
>                                                                     Arrow
>                                                                     and
>                                                                     query
>                                                                     engine
>
> implementations
>                                                                     (for
>
> example,
>
> DataFusion)
>                                                                     will
> be
>
>                             eager
>
>                                 to
>
>                                                     add
>
>                                                                 full
>
>                                                                     support
>                                                                     for
>                                                                     the
>                                                                     type,
>
> including
>                                                                     compute
>
> kernels?
>                                                                     Or are
>
>                     they
>
>                                     likely
>
>                                                     to
>
>                                                                 just
>
>                                                                     convert
>                                                                     this
>                                                                     type
>                                                                     to
>
> ListArray
>                                                                     at
>                                                                     import
>
> boundaries?
>
>
>                                                                 I can't
>                                                                 speak
>                                                                 for
>                                                                 query
>                                                                 engines
>                                                                 in
>                                                                 general,
>                                                                 but at
>                                                                 least
>
>                     for
>
>                                         arrow-rs
>
>                                                                 and by
>                                                                 extension
>                                                                 DataFusion,
>                                                                 and
>                                                                 based on
>                                                                 my current
>
>                                 understanding
>
>                                         of
>
>                                                                 the
>                                                                 use-cases
>                                                                 I would
>                                                                 be
>                                                                 rather
>                                                                 hesitant
>                                                                 to add
>                                                                 support
>
>                     to the
>
>                                                     kernels
>
>                                                                 for this
>                                                                 array
>                                                                 type,
>                                                                 definitely
>                                                                 instead
>                                                                 favouring
>
>                     conversion
>
>                             at
>
>                                     the
>
>                                                                 edges.
>                                                                 We
>                                                                 already
>                                                                 have
>                                                                 issues
>                                                                 with the
>                                                                 amount
>                                                                 of code
>
>                             generation
>
>                                                                 resulting
>                                                                 in
>                                                                 binary
>                                                                 bloat
>                                                                 and long
>                                                                 compile
>                                                                 times,
>                                                                 and I
>
>                     worry
>
>                                 this
>
>                                                     would
>
>                                                                 worsen
>                                                                 this
>                                                                 situation
>                                                                 whilst
>                                                                 not
>                                                                 really
>                                                                 providing
>
>                     compelling
>
>                                                     advantages
>
>                                                                 for the
>                                                                 vast
>                                                                 majority
>                                                                 of
>                                                                 workloads
>                                                                 that
>                                                                 don't
>                                                                 interact
>
>                     with
>
>                                 Velox.
>
>                                                                 Whilst I
>                                                                 can
>                                                                 definitely
>                                                                 see that
>                                                                 the
>                                                                 ListView
>
>                     representation
>
>                             is
>
>                                                     probably
>
>                                                                 a better
>                                                                 way to
>                                                                 represent
>                                                                 variable
>                                                                 length
>                                                                 lists
>                                                                 than what
>
>                             arrow
>
>                                                     settled
>
>                                                                 upon,
>                                                                 I'm not
>                                                                 yet
>                                                                 convinced
>                                                                 it is
>
> sufficiently
>                                                                 better to
>
>                                     incentivise
>
>                                                                 broad
>                                                                 ecosystem
>                                                                 adoption.
>                                                                 Kind
>                                                                 Regards,
>                                                                 Raphael
>
> Taylor-Davies
>
>
>                                                                     On
>
> 11/05/2023
>                                                                     21:20,
>                                                                     Will
>                                                                     Jones
>                                                                     wrote:
>                                                                     Hi
>                                                                     Felipe,
>                                                                     Thanks
>                                                                     for
>                                                                     the
>
> additional
>
> details.
>
>
>
> Velox
>
> kernels
>
> benefit
>
> from
>
> being
>
> able
>                                                                         to
>
> append
>
> data
>                                                                         to
>
>                     the
>
>                                     array
>
>                                                     from
>
>
> different
>
> threads
>
> without
>
> care
>                                                                         for
>
> strict
>
> ordering.
>
>
>                     Only the
>
>                                                     offsets
>
>                                                                 array
>
>                                                                         has
>                                                                         to
>                                                                         be
>
> written
>
> according
>                                                                         to
>
> logical
>
> order
>                                                                         but
>
> that
>                                                                         is
>
>                                                     potentially a
>
>                                                                 much
>
>
> smaller
>
> buffer
>
> than
>                                                                         the
>
> values
>
> buffer.
>
>
>                                                                     It
>                                                                     still
>                                                                     seems
>                                                                     to
>                                                                     me
>                                                                     like
>
> applications
>                                                                     are
>                                                                     still
>                                                                     pretty
>
>                             niche,
>
>                                     as I
>
>                                                                 suspect
>
>                                                                     in
>                                                                     most
>                                                                     cases
>                                                                     the
>
> benefits
>                                                                     are
>
> outweighed
>                                                                     by
>                                                                     the
>                                                                     costs.
>
>                     The
>
>                                     benefit
>
>                                                         here
>
>                                                                     seems
>                                                                     pretty
>
> limited:
>                                                                     if
>                                                                     you
>                                                                     are
>                                                                     trying
>                                                                     to
>                                                                     split
>                                                                     work
>
>                     between
>
>                                                     threads,
>
>                                                                     usually
>                                                                     you
>                                                                     will
>                                                                     have
>                                                                     other
>                                                                     levels
>                                                                     such
>                                                                     as
>                                                                     array
>                                                                     chunks
>
>                     to
>
>                                                     parallelize.
>
>                                                                 And
>
>                                                                     if
>                                                                     you
>                                                                     have
>                                                                     an
>
> incoming
>                                                                     stream
>                                                                     of
>                                                                     row
>                                                                     data,
>                                                                     you'll
>                                                                     want
>
>                     to
>
>                                 append
>
>                                         in
>
>
> predictable
>                                                                     order
>                                                                     to
>                                                                     match
>                                                                     the
>                                                                     order
>                                                                     of
>                                                                     the
>                                                                     other
>
>                     arrays. Am
>
>                             I
>
>                                                     missing
>
>
> something?
>                                                                     And,
>                                                                     IIUC,
>                                                                     the
>                                                                     cost
>                                                                     of
>                                                                     using
>
> ListView
>                                                                     with
>
> out-of-order
>
>
>                             values
>
>                                     over
>
>
> ListArray
>                                                                     is
>                                                                     you
>                                                                     lose
>                                                                     memory
>
> locality;
>                                                                     the
>                                                                     values
>                                                                     of
>
>                     element
>
>                             2
>
>                                     are
>
>                                                     no
>
>                                                                     longer
>
> adjacent
>                                                                     to
>                                                                     the
>                                                                     values
>                                                                     of
>                                                                     element
>                                                                     1.
>                                                                     What
>                                                                     do you
>
>                     think
>
>                                     about
>
>                                                         that
>
>
> tradeoff?
>                                                                     I
>                                                                     don't
>                                                                     mean
>                                                                     to
>                                                                     be
>
> difficult
>                                                                     about
>                                                                     this.
>                                                                     I'm
>                                                                     excited
>                                                                     for
>
>                     both
>
>                                 the
>
>                                                     REE
>
>                                                         and
>
>
> StringView
>                                                                     arrays,
>                                                                     but
>                                                                     this
>                                                                     one
>                                                                     I'm
>                                                                     not
>                                                                     so
>                                                                     sure
>                                                                     about
>
>                     yet. I
>
>                                         suppose
>
>                                                                 what I
>
>                                                                     am
>                                                                     trying
>                                                                     to
>                                                                     ask
>                                                                     is,
>                                                                     if
>                                                                     we
>                                                                     added
>                                                                     this,
>                                                                     do
>                                                                     we
>                                                                     think
>                                                                     many
>
>                     Arrow
>
>                                 and
>
>                                                     query
>
>                                                                     engine
>
> implementations
>                                                                     (for
>
> example,
>
> DataFusion)
>                                                                     will
> be
>
>                             eager
>
>                                 to
>
>                                                     add
>
>                                                                 full
>
>                                                                     support
>                                                                     for
>                                                                     the
>                                                                     type,
>
> including
>                                                                     compute
>
> kernels?
>                                                                     Or are
>
>                     they
>
>                                     likely
>
>                                                     to
>
>                                                                 just
>
>                                                                     convert
>                                                                     this
>                                                                     type
>                                                                     to
>
> ListArray
>                                                                     at
>                                                                     import
>
> boundaries?
>                                                                     Because
>                                                                     if
>                                                                     it
>                                                                     turns
>                                                                     out
>                                                                     to
>                                                                     be
>                                                                     the
>                                                                     latter,
>                                                                     then
>                                                                     we
>                                                                     might
>
>                     as
>
>                             well
>
>                                     ask
>
>                                                         Velox
>
>                                                                     to
>                                                                     export
>                                                                     this
>                                                                     type
>                                                                     as
>
> ListArray
>                                                                     and
>                                                                     save
>                                                                     the
>                                                                     rest
>                                                                     of the
>
>                                     ecosystem
>
>                                                         some
>
>                                                                     work.
>                                                                     Best,
>                                                                     Will
>                                                                     Jones
>                                                                     On
>                                                                     Thu,
>                                                                     May
>                                                                     11,
>                                                                     2023
>                                                                     at
>
> 12:32 PM
>                                                                     Felipe
>
> Oliveira
>
>
>                     Carvalho <
>
>
> felipe...@gmail.com<mailto:felipe...@gmail.com
>
> <mailto:felipe...@gmail.com>>>
>                                                                     wrote:
>
>
> Initial
>
> reason
>                                                                         for
>
> ListView
>
> arrays
>                                                                         in
>
> Arrow
>                                                                         is
>
> zero-copy
>
>
>                                                     compatibility
>
>                                                                 with
>
>
> Velox
>
> which
>
> uses
>
> this
>
> format.
>
> Velox
>
> kernels
>
> benefit
>
> from
>
> being
>
> able
>                                                                         to
>
> append
>
> data
>                                                                         to
>
>                     the
>
>                                     array
>
>                                                     from
>
>
> different
>
> threads
>
> without
>
> care
>                                                                         for
>
> strict
>
> ordering.
>
>
>                     Only the
>
>                                                     offsets
>
>                                                                 array
>
>                                                                         has
>                                                                         to
>                                                                         be
>
> written
>
> according
>                                                                         to
>
> logical
>
> order
>                                                                         but
>
> that
>                                                                         is
>
>                                                     potentially a
>
>                                                                 much
>
>
> smaller
>
> buffer
>
> than
>                                                                         the
>
> values
>
> buffer.
>
> Acero
>
> kernels
>
> could
>
> take
>
> advantage
>                                                                         of
>
> that
>                                                                         in
>
> the
>
>                     future.
>
>                                                                         In
>
> implementing
>
> ListViewArray/Type
>                                                                         I
>                                                                         was
>
> able
>                                                                         to
>
> reuse
>
>
>                     some
>
>                                 C++
>
>                                                                 templates
>
>
> used
>                                                                         for
>
> ListArray
>
> which
>                                                                         can
>
> reduce
>
> some
>                                                                         of
>                                                                         the
>
> burden
>
>
>                     on
>
>                                 kernel
>
>
> implementations
>
> that
>                                                                         aim
>                                                                         to
>
> work
>
> with
>                                                                         all
>                                                                         the
>
> types.
>                                                                         I’m
>                                                                         can
>                                                                         fix
>
> Acero
>
> kernels
>                                                                         for
>
> working
>
> with
>
> ListView.
>
>
>                     This is
>
>                                                     similar
>
>                                                         to
>
>                                                                 the
>
>
> work
>
> I’ve
>
> doing
>                                                                         in
>
> kernels
>
> dealing
>
> with
>
> run-end
>
> encoded
>
>
>                                 arrays.
>
>                                                                         —
>
> Felipe
>                                                                         On
>
> Wed,
>                                                                         26
>                                                                         Apr
>
> 2023
>                                                                         at
>
> 01:03
>
> Will
>
> Jones
>                                                                         <
>
>                                     will.jones...@gmail.com
>
>                                                                 <mailto:
> will.jones...@gmail.com
>                                                                 <mailto:
> will.jones...@gmail.com>>>
>                                                                 wrote:
>
>
>   I
>
>   suppose
>
>   one
>
>   common
>
>   use
>
>   case
>
>   is
>
>   materializing
>
>   list
>
>
>                     columns
>
>                                     after
>
>                                                         some
>
>
>   expanding
>
>   operation
>
>   like
>
>   a
>
>   join
>
>   or
>
>   unnest.
>
>   That's
>
>   a
>
>
>                     case
>
>                                 where
>
>                                     I
>
>                                                         could
>
>
>   imagine
>
>   a
>
>   lot
>
>   of
>
>   repetition
>
>   of
>
>   values.
>
>   Haven't
>
>   yet
>
>
>                     thought
>
>                             of
>
>                                                     common
>
>
> cases
>
>
>
>   where
>
>   there
>
>   is
>
>   overlap
>
>   but
>
>   not
>
>   full
>
>   duplication,
>
>   but
>
>   am
>
>
>                             eager
>
>                                     to
>
>                                                     hear
>
>
> any.
>
>
>
>   The
>
>   dictionary
>
>   encoding
>
>   point
>
>   Raphael
>
>   makes
>
>   is
>
>
>                     interesting,
>
>                                                         especially
>
>
>   given
>
>   the
>
>   existence
>
>   of
>
>   LargeList
>
>   and
>
>   FixedSizeList.
>
>   For
>
>
>                             many
>
>                                                                 operations,
>
>                                                                         it
>
>
>   might
>
>   make
>
>   more
>
>   sense
>
>   to
>
>   just
>
>   compose
>
>   those
>
>   existing
>
>
>                     types.
>
>
>   IIUC
>
>   the
>
>   operations
>
>   that
>
>   would
>
>   be
>
>   unique
>
>   to
>
>   the
>
>
>                     ArrayView
>
>                             are
>
>                                         ones
>
>
> altering
>
>
>
>   the
>
>   shape.
>
>   One
>
>   could
>
>   truncate
>
>   each
>
>   array
>
>   to
>
>   a
>
>   certain
>
>
>                             length
>
>                                                     cheaply
>
>
> simply
>
>
>
>   by
>
>   replacing
>
>   the
>
>   sizes
>
>   buffer.
>
>   Or
>
>   perhaps
>
>   there
>
>   are
>
>
>                                 interesting
>
>
> operations
>
>
>
>   on
>
>   tensors
>
>   that
>
>   would
>
>   benefit.
>
>   On
>
>   Tue,
>
>   Apr
>
>   25,
>
>   2023
>
>   at
>
>   7:47 PM
>
>   Raphael
>
>   Taylor-Davies
>
>   <r.taylordav...@googlemail.com.invalid>
>
>   wrote:
>
>
>
>       Unless
>
>       I
>
>       am
>
>       missing
>
>       something,
>
>       I
>
>       think
>
>       the
>
>       selection
>
>
>                                 use-case
>
>                                                     could
>
>                                                         be
>
>
>       equally
>
>       well
>
>       served
>
>       by
>
>       a
>
>       dictionary-encoded
>
>
>                                         BinarArray/ListArray,
>
>                                                         and
>
>
>   would
>
>
>
>       have
>
>       the
>
>       benefit
>
>       of
>
>       not
>
>       requiring
>
>       any
>
>       modifications
>
>
>                     to the
>
>                                                     existing
>
>
>   format
>
>
>
>       or
>
>       kernels.
>
>       The
>
>       major
>
>       additional
>
>       flexibility
>
>       of
>
>       the
>
>       proposed
>
>
>                     encoding
>
>                                     would
>
>                                                     be
>
>
>       permitting
>
>       disjoint
>
>       or
>
>       overlapping
>
>       ranges,
>
>       are
>
>       these
>
>
>                             common
>
>                                                     enough
>
>                                                         in
>
>
>       practice
>
>       to
>
>       represent
>
>       a
>
>       meaningful
>
>       bottleneck?
>
>       On
>
>       26
>
>       April
>
>       2023
>
>       01:40:14
>
>       BST,
>
>       David
>
>       Li
>
>       <
>
>
>                                 lidav...@apache.org
>
>                                                         <mailto: <mailto:>
>
>
> lidav...@apache.org>>
>                                                                 wrote:
>
>
>           Is
>
>           there
>
>           a
>
>           need
>
>           for
>
>           a
>
>           64-bit
>
>           offsets
>
>           version
>
>           the
>
>
>                     same way
>
>                                 we
>
>                                                     have
>
>                                                                 List
>
>
>       and
>
>       LargeList?
>
>
>
>           And
>
>           just
>
>           to
>
>           be
>
>           clear,
>
>           the
>
>           difference
>
>           with
>
>           List
>
>           is
>
>
>                     that
>
>                             the
>
>                                         lists
>
>                                                                 don't
>
>
>       have
>
>       to
>
>       be
>
>       stored
>
>       in
>
>       their
>
>       logical
>
>       order
>
>       (or
>
>       in
>
>       other
>
>
>                             words,
>
>                                                     offsets
>
>                                                                 do
>
>
>   not
>
>
>
>       have
>
>       to
>
>       be
>
>       nondecreasing
>
>       and
>
>       so
>
>       we
>
>       also
>
>       need
>
>       sizes)?
>
>
>
>           On
>
>           Wed,
>
>           Apr
>
>           26,
>
>           2023,
>
>           at
>
>           09:37,
>
>           Weston
>
>           Pace
>
>           wrote:
>
>
>
>               For
>
>               context,
>
>               there
>
>               was
>
>               some
>
>               discussion
>
>               on
>
>               this
>
>               back
>
>
>                     in
>
>                                 [1].
>
>                                     At
>
>                                                         that
>
>
>       time
>
>
>
>               this
>
>               was
>
>               called
>
>               "sequence
>
>               view"
>
>               but
>
>               I
>
>               do
>
>               not
>
>               like
>
>
>                     that
>
>                                 name.
>
>
> However,
>
>
>
>               array-view
>
>               array
>
>               is
>
>               a
>
>               little
>
>               confusing.
>
>               Given
>
>               this
>
>
>                     is
>
>                                     similar
>
>                                                     to
>
>
> list
>
>
>
>       can
>
>
>
>               we
>
>               go
>
>               with
>
>               list-view
>
>               array?
>
>
>
>                   Thanks
>
>                   for
>
>                   the
>
>                   introduction.
>
>                   I'd
>
>                   be
>
>                   interested
>
>                   to
>
>
>                     hear
>
>                                     about
>
>                                                     the
>
>
>                   applications
>
>                   Velox
>
>                   has
>
>                   found
>
>                   for
>
>                   these
>
>                   vectors,
>
>
>                     and in
>
>                                 what
>
>
>   situations
>
>
>
>               they
>
>
>
>                   are
>
>                   useful.
>
>                   This
>
>                   could
>
>                   be
>
>                   contrasted
>
>                   with
>
>                   the
>
>
>                     current
>
>                                                     ListArray
>
>
>                   implementations.
>
>
>
>               I
>
>               believe
>
>               one
>
>               significant
>
>               benefit
>
>               is
>
>               that
>
>               take
>
>               (and
>
>
>                     by
>
>                                     proxy,
>
>
> filter)
>
>
>
>       and
>
>
>
>               sort
>
>               are
>
>               O(#
>
>               of
>
>               items)
>
>               with
>
>               the
>
>               proposed
>
>               format
>
>               and
>
>
>                     O(#
>
>                             of
>
>                                                     bytes)
>
>
> with
>
>
>
>       the
>
>
>
>               current
>
>               format.
>
>               Jorge
>
>               did
>
>               some
>
>               profiling
>
>               to
>
>               this
>
>
>                     effect
>
>                             in
>
>                                         [1].
>
>
>               [1]
>
>
>
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<
>
>
>
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
>
>
>
>               On
>
>               Tue,
>
>               Apr
>
>               25,
>
>               2023
>
>               at
>
>               3:13 PM
>
>               Will
>
>               Jones
>
>               <
>
>
>
> will.jones...@gmail.com
>
>
>                                                                 <mailto:
> will.jones...@gmail.com
>                                                                 <mailto:
> will.jones...@gmail.com>>
>
>
>
>       wrote:
>
>
>
>                   Hi
>
>                   Felipe,
>
>                   Thanks
>
>                   for
>
>                   the
>
>                   introduction.
>
>                   I'd
>
>                   be
>
>                   interested
>
>                   to
>
>
>                     hear
>
>                                     about
>
>                                                     the
>
>
>                   applications
>
>                   Velox
>
>                   has
>
>                   found
>
>                   for
>
>                   these
>
>                   vectors,
>
>
>                     and in
>
>                                 what
>
>
>   situations
>
>
>
>       they
>
>
>
>                   are
>
>                   useful.
>
>                   This
>
>                   could
>
>                   be
>
>                   contrasted
>
>                   with
>
>                   the
>
>
>                     current
>
>                                                     ListArray
>
>
>                   implementations.
>
>                   IIUC
>
>                   it
>
>                   would
>
>                   be
>
>                   fairly
>
>                   cheap
>
>                   to
>
>                   transform
>
>                   a
>
>
>                     ListArray
>
>                             to
>
>                                     an
>
>
>       ArrayView,
>
>       but
>
>
>
>                   expensive
>
>                   to
>
>                   go
>
>                   the
>
>                   other
>
>                   way.
>
>                   Best,
>
>                   Will
>
>                   Jones
>
>                   On
>
>                   Tue,
>
>                   Apr
>
>                   25,
>
>                   2023
>
>                   at
>
>                   3:00 PM
>
>                   Felipe
>
>                   Oliveira
>
>
>                             Carvalho
>
>                                 <
>
>
>                   felipe...@gmail.com<mailto:felipe...@gmail.com
>
>                   <mailto:felipe...@gmail.com>>>
>
>
>                             wrote:
>
>
>                       Hi
>
>                       folks,
>
>                       I
>
>                       would
>
>                       like
>
>                       to
>
>                       start
>
>                       a
>
>                       public
>
>                       discussion
>
>                       on
>
>                       the
>
>
>                                 inclusion
>
>                                                     of a
>
>                                                                         new
>
>
>       array
>
>
>
>                       format
>
>                       to
>
>                       Arrow
>
>                       —
>
>                       array-view
>
>                       array.
>
>                       The
>
>                       name
>
>                       is
>
>
>                     also
>
>                             up
>
>                                     for
>
>
> debate.
>
>
>
>                       This
>
>                       format
>
>                       is
>
>                       inspired
>
>                       by
>
>                       Velox's
>
>                       ArrayVector
>
>
>                     format
>
>                                 [1].
>
>
>   Logically,
>
>
>
>                   this
>
>
>
>                       array
>
>                       represents
>
>                       an
>
>                       array
>
>                       of
>
>                       arrays.
>
>                       Each
>
>                       element
>
>
>                     is
>
>                             an
>
>
> array-view
>
>
>
>                   (offset
>
>
>
>                       and
>
>                       size
>
>                       pair)
>
>                       that
>
>                       points
>
>                       to
>
>                       a
>
>                       range
>
>                       within
>
>                       a
>
>
>                     nested
>
>                                                     "values"
>
>
>   array
>
>
>
>                       (called
>
>                       "elements"
>
>                       in
>
>                       Velox
>
>                       docs).
>
>                       The
>
>                       nested
>
>
>                     array
>
>                             can
>
>                                 be
>
>                                         of
>
>                                                         any
>
>
>       type,
>
>
>
>                       which
>
>                       makes
>
>                       this
>
>                       format
>
>                       very
>
>                       flexible
>
>                       and
>
>
>                     powerful.
>
>
>                       [image:
>
>                       ../_images/array-vector.png]
>
>                       <
>
>
>
> https://facebookincubator.github.io/velox/_images/array-vector.png
>
>
>                                 <
>
>         https://facebookincubator.github.io/velox/_images/array-vector.png
>
>
>                       I'm
>
>                       currently
>
>                       working
>
>                       on
>
>                       a
>
>                       C++
>
>                       implementation
>
>                       and
>
>
>                     plan
>
>                                 to
>
>                                                     work
>
>                                                                         on
>                                                                         a
>
>
>       Go
>
>
>
>                       implementation
>
>                       to
>
>                       fulfill
>
>                       the
>
>                       two-implementations
>
>
>                                         requirement
>
>                                                         for
>
>
>       format
>
>
>
>                       changes.
>
>                       The
>
>                       draft
>
>                       design:
>
>                       -
>
>                       3
>
>                       buffers:
>
>                       [validity_bitmap,
>
>                       int32
>
>                       offsets
>
>
>                     buffer,
>
>                                 int32
>
>                                                     sizes
>
>
>       buffer]
>
>
>
>                       -
>
>                       1
>
>                       child
>
>                       array:
>
>                       "values"
>
>                       as
>
>                       an
>
>                       array
>
>                       of
>
>                       the
>
>                       type
>
>
>                                     parameter
>
>
>                       validity_bitmap
>
>                       is
>
>                       used
>
>                       to
>
>                       differentiate
>
>                       between
>
>
>                     empty
>
>                                     array
>
>
> views
>
>
>
>                       (sizes[i]
>
>                       ==
>
>                       0)
>
>                       and
>
>                       NULL
>
>                       array
>
>                       views
>
>
>                             (validity_bitmap[i]
>
>                                     ==
>
>                                                     0).
>
>
>                       When
>
>                       the
>
>                       validity_bitmap[i]
>
>                       is
>
>                       0,
>
>                       both
>
>                       sizes
>
>                       and
>
>
>                             offsets
>
>                                     are
>
>
>       undefined
>
>
>
>                   (as
>
>
>
>                       usual),
>
>                       and
>
>                       when
>
>                       sizes[i]
>
>                       ==
>
>                       0,
>
>                       offsets[i]
>
>                       is
>
>
>                                 undefined. 0
>
>                                         is
>
>
>       recommended
>
>
>
>                       if
>
>                       setting
>
>                       a
>
>                       value
>
>                       is
>
>                       not
>
>                       an
>
>                       issue
>
>                       to
>
>                       the
>
>                       system
>
>
>                                 producing
>
>                                                     the
>
>
>       arrays.
>
>
>
>                       offsets
>
>                       buffer
>
>                       is
>
>                       not
>
>                       required
>
>                       to
>
>                       be
>
>                       ordered
>
>                       and
>
>
>                     views
>
>                                     don't
>
>                                                         have
>
>
>   to
>
>
>
>       be
>
>
>
>                       disjoint.
>
>                       [1]
>
>
>
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>
>
>                                                                 <
>
>
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>
>
>
>                       Thanks,
>
>                       Felipe
>
>                       O.
>
>                       Carvalho
>
>

Reply via email to