General approach to alternative formats aside, in the specific case of ListView, I think the implementation complexity is being overestimated in these discussions.
The C++ Arrow implementation shares a lot of code between List and LargeList. And with some tweaks, I'm able to share that common infrastructure for ListView as well. [1] ListView is similar to list: it doesn't require offsets to be sorted and adds an extra buffer containing sizes. For symmetry with the List and LargeList types (FixedSizeList not included), I'm going to propose we add a LargeListView. That is not part of the draft implementation yet, but seems like an obvious thing to have now that I implemented the `if_else` specialization. [2] David Li asked about this above and I can confirm now that 64-bit version of ListView (LargeListView) is in the plans. Trying to avoid re-implementing some kernels is not a good goal to chase, IMO, because kernels need tweaks to take advantage of the format. [1] https://github.com/apache/arrow/pull/35345 [2] https://github.com/felipecrv/arrow/commits/list_view_backup -- Felipe On Wed, Jun 14, 2023 at 12:08 PM Weston Pace <weston.p...@gmail.com> wrote: > > perhaps we could support this use-case as > > a canonical extension type over dictionary encoded, variable-sized > > arrays > > I believe this suggestion is valid and could be used to solve the if-else > case. The algorithm, if I understand it, would be roughly: > > ``` > // Note: Simple pseudocode, vectorization left as exercise for the reader > auto indices_builder = ... > auto list_builder = ... > indices_builder.resize(batch.length); > Array condition_mask = condition.EvaluateBatch(batch); > for row_index in selected_rows(condition_mask): > indices_builder[row_index] = list_builder.CurrentLength(); > list_builder.Append(if_expr.EvaluateRow(batch, row_index)) > for row_index in unselected_rows(condition_mask): > indices_builder[row_index] = list_builder.CurrentLength(); > list_builder.Append(else_expr.EvaluateRow(batch, row_index)) > return DictionaryArray(indices_builder.Finish(), list_builder.Finish()) > ``` > > I also agree this is a slightly awkward use of dictionaries (e.g. the > dictionary would have the same size as the # of indices) and perhaps not > the most intuitive way to solve the problem. > > My gut reaction is simply "an improved if/else kernel is not, alone, enough > justification for a new layout" and yet... > > I think we are seeing the start (I hope) of a trend where Arrow is not just > used "between systems" (e.g. to shuttle data from one place to another, or > between a query engine and a visualization tool) but also "within systems" > (e.g. UDFs, bespoke file formats and temporary tables, between workers in a > distributed query engine). When arrow is used "within systems" I think > both the number of bespoke formats and the significance of conversion cost > increases. For example, it's easy to say that Velox should convert at the > boundary as data leaves Velox. But what if Velox (or datafusion or ...) > were to define an interface for UDFs. Would we want to use Arrow there > (e.g. the C data interface is a good fit)? If so, wouldn't the conversion > cost be more significant? > > > Also, I'm very lukewarm towards the concept of "alternative layouts" > > suggested somewhere else in this thread. It does not seem a good choice > > to complexify the Arrow format that much. > > I think, in my opinion, this depends on how many of these alternative > layouts exist. If there are just a few, then I agree, we should just adopt > them as formal first-class layouts. If there are many, then I think it > will be too much complexity in Arrow to have all the different choices. > Or, we could say there are many, but the alternatives don't belong in Arrow > at all. In that case I think it's the same question as the above > paragraph, "do we want Arrow to be used within systems? Or just between > systems?" > > > On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > I agree that ListView cannot be an extension type, given that it > > features a new layout, and therefore cannot reasonably be backed by an > > existing storage type (AFAICT). > > > > Also, I'm very lukewarm towards the concept of "alternative layouts" > > suggested somewhere else in this thread. It does not seem a good choice > > to complexify the Arrow format that much. > > > > Regards > > > > Antoine. > > > > > > Le 07/06/2023 à 00:21, Felipe Oliveira Carvalho a écrit : > > > +1 on what Ian said. > > > > > > And as I write kernels for this new format, I’m learning that it’s > > possible > > > to re-use the common infrastructure used by List and LargeList to > > implement > > > the ListView related features with some adjustments. > > > > > > IMO having this format as a second-class citizen would more likely > > > complicate things because it would make this unification harder. > > > > > > — > > > Felipe > > > > > > On Tue, 6 Jun 2023 at 18:45 Ian Cook <ianmc...@apache.org> wrote: > > > > > >> To clarify why we cannot simply propose adding ListView as a new > > >> “canonical extension type”: The extension type mechanism in Arrow > > >> depends on the underlying data being organized in an existing Arrow > > >> layout—that way an implementation that does not support the extension > > >> type can still handle the underlying data. But ListView is a wholly > > >> new layout. > > >> > > >> I strongly agree with Weston’s idea that it is a good time for Arrow > > >> to introduce the notion of “canonical alternative layouts.” > > >> > > >> Taken together, I think that canonical extension types and canonical > > >> alternative layouts could serve as an “incubator” for proposed new > > >> representations. For example, if a proposed canonical alternative > > >> layout ends up being broadly adopted, then that will serve as a signal > > >> that we should consider adding it as a primary layout in the core > > >> spec. > > >> > > >> It seems to me that most projects that are implementing Arrow today > > >> are not aiming to provide complete coverage of Arrow; rather they are > > >> adopting Arrow because of its role as a standard and they are > > >> implementing only as much of the Arrow standard as they require to > > >> achieve some goal. I believe that such projects are important Arrow > > >> stakeholders, and I believe that this proposed notion of canonical > > >> alternative layouts will serve them well and will create efficiencies > > >> by standardizing implementations around a shared set of alternatives. > > >> > > >> However I think that the documentation for canonical alternative > > >> layouts should strongly encourage implementers to default to using the > > >> primary layouts defined in the core spec and only use alternative > > >> layouts in cases where the primary layouts do not meet their needs. > > >> > > >> > > >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield < > emkornfi...@gmail.com> > > >> wrote: > > >>> > > >>> This sounds reasonable to me but my main concern is, I'm not sure > there > > >> is > > >>> a great mechanism to enforce canonical layouts don't somehow become > > >> default > > >>> (or the only implementation). > > >>> > > >>> Even for these new layouts, I think it might be worth rethinking > > binding > > >> a > > >>> layout into the schema versus having a different concept of encoding > > (and > > >>> changing some of the corresponding data structures). > > >>> > > >>> > > >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace <weston.p...@gmail.com> > > >> wrote: > > >>> > > >>>> Trying to settle on one option is a fruitless endeavor. Each type > has > > >> pros > > >>>> and cons. I would also predict that the largest existing usage of > > >> Arrow is > > >>>> shuttling data from one system to another. The newly proposed > format > > >>>> doesn't appear to have any significant advantage for that use case > (if > > >>>> anything, the existing format is arguably better as it is more > > >> compact). > > >>>> > > >>>> I am very biased towards historical precedent and avoiding breaking > > >>>> changes. > > >>>> > > >>>> We have "canonical extension types", perhaps it is time for > "canonical > > >>>> alternative layouts". We could define it as such: > > >>>> > > >>>> * There are one or more primary layouts > > >>>> * Existing layouts are automatically considered primary layouts, > > >> even if > > >>>> they wouldn't > > >>>> have been primary layouts initially (e.g. large list) > > >>>> * A new layout, if it is semantically equivalent to another, is > > >> considered > > >>>> an alternative layout > > >>>> * An alternative layout still has the same requirements for > adoption > > >> (two > > >>>> implementations and a vote) > > >>>> * An implementation should not feel pressured to rush and > > implement > > >> the > > >>>> new layout. > > >>>> It would be good if they contribute in the discussion and > > >> consider the > > >>>> layout and vote > > >>>> if they feel it would be an acceptable design. > > >>>> * We can define and vote and approve as many canonical alternative > > >> layouts > > >>>> as we want: > > >>>> * A canonical alternative layout should, at a minimum, have some > > >>>> reasonable justification, such as improved performance for > > >> algorithm X > > >>>> * Arrow implementations MUST support the primary layouts > > >>>> * An Arrow implementation MAY support a canonical alternative, > > >> however: > > >>>> * An Arrow implementation MUST first support the primary layout > > >>>> * An Arrow implementation MUST support conversion to/from the > > >> primary > > >>>> and canonical layout > > >>>> * An Arrow implementation's APIs MUST only provide data in the > > >>>> alternative > > >>>> layout if it is explicitly asked for (e.g. schema inference > > should > > >>>> prefer the primary layout). > > >>>> * We can still vote for new primary layouts (e.g. promoting a > > >> canonical > > >>>> alternative) but, in these > > >>>> votes we don't only consider the value (e.g. performance) of > the > > >> layout > > >>>> but also the interoperability. > > >>>> In other words, a layout can only become a primary layout if > > there > > >> is > > >>>> significant evidence that most > > >>>> implementations plan to adopt it. > > >>>> > > >>>> This lets us evolve support for new layouts more naturally. We can > > >>>> generally assume that users will not, initially, be aware of these > > >>>> alternative layouts. However, everything will just work. They may > > >> start > > >>>> to see a performance penalty stemming from a lack of support for > these > > >>>> layouts. If this performance penalty becomes significant then they > > >> will > > >>>> discover it and become aware of the problem. They can then ask > > >> whatever > > >>>> library they are using to add support for the alternative layout. > As > > >>>> enough users find a need for it then libraries will add support. > > >>>> Eventually, enough libraries will support it that we can adopt it > as a > > >>>> primary layout. > > >>>> > > >>>> Also, it allows libraries to adopt alternative layouts more > > >> aggressively if > > >>>> they would like while still hopefully ensuring that we eventually > all > > >>>> converge on the same implementation of the alternative layout. > > >>>> > > >>>> On Mon, May 22, 2023 at 9:35 AM Will Jones <will.jones...@gmail.com > > > > >>>> wrote: > > >>>> > > >>>>> Hello Arrow devs, > > >>>>> > > >>>>> I don't understand why we would start deprecating features in the > > >> Arrow > > >>>>>> format. Even starting this talk might already be a bad idea > > >> PR-wise. > > >>>>>> > > >>>>> > > >>>>> I agree we don't want to make breaking changes to the Arrow format. > > >> But > > >>>>> several maintainers have already stated they have no interest in > > >>>>> maintaining both list types with full compute functionality [1][2], > > >> so I > > >>>>> think it's very likely one list type or the other will be > > >>>>> implicitly preferred in the ecosystem if this data type was added. > > >> If > > >>>>> that's the case, I'd prefer that we agreed as a community which one > > >>>> should > > >>>>> be preferred. Maybe that's not the best path; it's just one way for > > >> us to > > >>>>> balance stability, maintenance burden, and relevance. > > >>>>> > > >>>>> Can someone help distill down the primary rationale and usecase for > > >>>>>> adding ArrayView to the Arrow Spec? > > >>>>>> > > >>>>> > > >>>>> Looking back at that old thread, I think one of the main > motivations > > >> is > > >>>> to > > >>>>> try to prevent query engine implementers from feeling there is a > > >> tradeoff > > >>>>> between having state-of-the-art performance and being Arrow-native. > > >> For > > >>>>> some of the new array types, we had both Velox and DuckDB use them, > > >> so it > > >>>>> was reasonable to expect they were innovations that might > > >> proliferate. > > >>>> I'm > > >>>>> not sure if the ArrayView is part of that. From Wes earlier [3]: > > >>>>> > > >>>>> The idea is that in a world of data and query federation (for > > >> example, > > >>>>>> consider [1] where Arrow is being used as a data federation layer > > >> with > > >>>>> many > > >>>>>> query engines), we want to increase the amount of data in-flight > > >> and > > >>>>>> in-memory that is in Arrow format. So if query engines are having > > >> to > > >>>>> depart > > >>>>>> substantially from the Arrow format to get performance, then this > > >>>>> creates a > > >>>>>> potential lose-lose situation: * Depart from Arrow: get better > > >>>>> performance > > >>>>>> but pay serialization costs to read and write Arrow (the > > >> performance > > >>>> and > > >>>>>> resource utilization benefits outweigh the serialization costs). > > >> This > > >>>>> puts > > >>>>>> additional pressure on query engines to build specialized > > >> components > > >>>> for > > >>>>>> solving problems rather than making use of off-the-shelf > components > > >>>> that > > >>>>>> use Arrow. This has knock-on effects on ecosystem fragmentation. * > > >> Or > > >>>> use > > >>>>>> Arrow, and accept suboptimal query processing performance > > >>>>>> > > >>>>> > > >>>>> > > >>>>> Will mentions one usecase is Velox consuming python UDF output, > which > > >>>> seems > > >>>>>> to be mostly about how fast Velox can consume this format, not how > > >> fast > > >>>>> it > > >>>>>> can be written. Are there other usecases? > > >>>>>> > > >>>>> > > >>>>> To be clear, I don't know if that's the use case they want. That's > > >> just > > >>>> me > > >>>>> speculating. > > >>>>> > > >>>>> I still have some questions myself: > > >>>>> > > >>>>> 1. Is this array type currently only used in Velox? (not DuckDB > like > > >> some > > >>>>> of the other new types?) What evidence do we have that it will > become > > >>>> used > > >>>>> outside of Velox? > > >>>>> 2. We already have three list types: list, large list (64-bit > > >> offsets), > > >>>> and > > >>>>> fixed size list. Do we think we will only want a view version of > the > > >>>> 32-bit > > >>>>> offset variable length list? Or are we potentially talking about > view > > >>>>> variants for all three? > > >>>>> > > >>>>> Best, > > >>>>> > > >>>>> Will Jones > > >>>>> > > >>>>> [1] > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > >>>>> [2] > https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3 > > >>>>> [3] > https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t > > >>>>> > > >>>>> On Mon, May 22, 2023 at 3:48 AM Andrew Lamb <al...@influxdata.com> > > >>>> wrote: > > >>>>> > > >>>>>> Can someone help distill down the primary rationale and usecase > for > > >>>>>> adding ArrayView to the Arrow Spec? > > >>>>>> > > >>>>>> From the above discussions, the stated rationale seems to be fast > > >>>>>> (zero-copy) interchange with Velox. > > >>>>>> > > >>>>>> This thread has qualitatively enumerated the benefits of > > >> (offset+len) > > >>>>>> encoding over the existing Arrow ListArray (offets) approach, but > I > > >>>>> haven't > > >>>>>> seen any performance measurements that might help us to gauge the > > >>>>> tradeoff > > >>>>>> in additional complexity vs runtime overhead. > > >>>>>> > > >>>>>> Will mentions one usecase is Velox consuming python UDF output, > > >> which > > >>>>> seems > > >>>>>> to be mostly about how fast Velox can consume this format, not how > > >> fast > > >>>>> it > > >>>>>> can be written. Are there other usecases? > > >>>>>> > > >>>>>> Do we have numbers showing how much overhead converting to /from > > >>>> Velox's > > >>>>>> internal representation and the existing ListArray adds? Has > > >> anyone in > > >>>>>> Velox land considered adding faster support for Arrow style > > >> ListArray > > >>>>>> encoding? > > >>>>>> > > >>>>>> > > >>>>>> Andrew > > >>>>>> > > >>>>>> On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou < > anto...@python.org > > >>> > > >>>>> wrote: > > >>>>>> > > >>>>>>> > > >>>>>>> Hi, > > >>>>>>> > > >>>>>>> I don't understand why we would start deprecating features in the > > >>>> Arrow > > >>>>>>> format. Even starting this talk might already be a bad idea > > >> PR-wise. > > >>>>>>> > > >>>>>>> As for implementing conversions at the I/O boundary, it's a > > >>>> reasonably > > >>>>>>> policy, but it still requires work by implementors and it's not > > >>>> granted > > >>>>>>> that all consumers of the Arrow format will grow such conversions > > >>>>>>> if/when we add non-trivial types such as ListView or StringView. > > >>>>>>> > > >>>>>>> Regards > > >>>>>>> > > >>>>>>> Antoine. > > >>>>>>> > > >>>>>>> > > >>>>>>> Le 22/05/2023 à 00:39, Will Jones a écrit : > > >>>>>>>> One more thing: Looking back on the previous discussion[1] > > >> (which > > >>>>>> Weston > > >>>>>>>> pointed out in his earlier message), Jorge suggested that the > > >> old > > >>>>> list > > >>>>>>>> types might be deprecated in favor of view variants [2]. Others > > >>>> were > > >>>>>>>> worried that it might undermine the perception that the Arrow > > >>>> format > > >>>>> is > > >>>>>>>> stable. I think it might be worth thinking about "soft > > >> deprecating" > > >>>>> the > > >>>>>>> old > > >>>>>>>> list type: suggesting new implementations prefer the list > > >> view, but > > >>>>>>>> reassuring that implementations should support the old format, > > >> even > > >>>>> if > > >>>>>>> they > > >>>>>>>> just convert to the new format. To be clear, this wouldn't > > >> mean we > > >>>>> plan > > >>>>>>> to > > >>>>>>>> create breaking changes in the format; but if we ever did for > > >> other > > >>>>>>>> reasons, the old list type might go. > > >>>>>>>> > > >>>>>>>> Arrow compute libraries could choose either format for compute > > >>>>> support, > > >>>>>>> and > > >>>>>>>> plan to do conversion at the boundaries. Libraries that use > > >> the new > > >>>>>> type > > >>>>>>>> will have cheap conversion when reading the old type. Meanwhile > > >>>> those > > >>>>>>> that > > >>>>>>>> are still on the old type will have some incentive to move > > >> towards > > >>>>> the > > >>>>>>> new > > >>>>>>>> one, since that conversion will not be as efficient. > > >>>>>>>> > > >>>>>>>> [1] > > >>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > >>>>>>>> [2] > > >>>> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > >>>>>>>> > > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones < > > >>>> will.jones...@gmail.com> > > >>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Hello, > > >>>>>>>>> > > >>>>>>>>> I think Sasha brings up a good point, that the advantages of > > >> this > > >>>>>> format > > >>>>>>>>> seem to be primarily about query processing. Other encodings > > >> like > > >>>>> REE > > >>>>>>> and > > >>>>>>>>> dictionary have space-saving advantages that justify them > > >> simply > > >>>> in > > >>>>>>> terms > > >>>>>>>>> of space efficiency (although they have query processing > > >>>> advantages > > >>>>> as > > >>>>>>>>> well). As discussed, most use cases are already well served by > > >>>>>> existing > > >>>>>>>>> list types and dictionary encoding. > > >>>>>>>>> > > >>>>>>>>> I agree that there are cases where transferring this type > > >> without > > >>>>>>>>> conversion would be ideal. One use case I can think of is if > > >> Velox > > >>>>>>> wants to > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with > > >> PyArrow, > > >>>> for > > >>>>>>>>> example) that operate on this column type and therefore wants > > >>>>>> zero-copy > > >>>>>>>>> exchange over the C Data Interface. > > >>>>>>>>> > > >>>>>>>>> One big question I have: we already have three list types: > > >> list, > > >>>>> large > > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we > > >> will > > >>>> only > > >>>>>>> want a > > >>>>>>>>> view version of the 32-bit offset variable length list? Or > > >> are we > > >>>>>>>>> potentially talking about view variants for all three? > > >>>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> > > >>>>>>>>> Will Jones > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho < > > >>>>>>>>> felipe...@gmail.com> wrote: > > >>>>>>>>> > > >>>>>>>>>> The benefit of having a memory format that’s friendly to > > >>>>>>> non-deterministic > > >>>>>>>>>> order writes is unlocked by the transport and processing of > > >> the > > >>>>> data > > >>>>>>> being > > >>>>>>>>>> agnostic to the physical order as much as possible. > > >>>>>>>>>> > > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But it > > >> can > > >>>>> be a > > >>>>>>>>>> provisory step for compatibility between systems that don’t > > >>>>>> understand > > >>>>>>> the > > >>>>>>>>>> format yet. This is similar to the situation with compression > > >>>>> schemes > > >>>>>>> like > > >>>>>>>>>> run-end encoding — the goal is processing the compressed data > > >>>>>> directly > > >>>>>>>>>> without an expansion step whenever possible. > > >>>>>>>>>> > > >>>>>>>>>> This is why having it as part of the open Arrow format is so > > >>>>>> important: > > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel > > >> and/or > > >>>>>>>>>> vectorized compute kernels without introducing multiple > > >>>>> incompatible > > >>>>>>>>>> formats to the ecosystem and without imposing a conversion > > >> step > > >>>>>> between > > >>>>>>>>>> the > > >>>>>>>>>> different systems. > > >>>>>>>>>> > > >>>>>>>>>> — > > >>>>>>>>>> Felipe > > >>>>>>>>>> > > >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin > > >> <octalene....@pm.me.invalid> > > >>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> I don't feel like this representation is necessarily a > > >> detail of > > >>>>> the > > >>>>>>>>>> query > > >>>>>>>>>>> engine, but I am also not sure why this representation would > > >>>> have > > >>>>> to > > >>>>>>> be > > >>>>>>>>>>> converted to a non-view format when serializing. Could you > > >>>> clarify > > >>>>>>>>>> that? My > > >>>>>>>>>>> impression is that this representation could be used for > > >>>>> persistence > > >>>>>>> or > > >>>>>>>>>>> data transfer, though it can be more complex to guarantee > > >> the > > >>>>>> portion > > >>>>>>> of > > >>>>>>>>>>> the buffer that an index points to is also present in > > >> memory. > > >>>>>>>>>>> > > >>>>>>>>>>> Sent from Proton Mail for iOS > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky < > > >>>>>>>>>> krassovskysa...@gmail.com > > >>>>>>>>>>> > > >> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>> > > >>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi everyone, > > >>>>>>>>>>> I understand that there are numerous benefits to this > > >>>>> representation > > >>>>>>>>>>> during query processing, but would it be fair to say that > > >> this > > >>>> is > > >>>>> an > > >>>>>>>>>>> implementation detail of the query engine? Query engines > > >> don’t > > >>>>>>>>>> necessarily > > >>>>>>>>>>> need to conform to the Arrow format internally, only at > > >>>>>> ingest/egress > > >>>>>>>>>>> points, and performing a conversion from the non-view to > > >> view > > >>>>> format > > >>>>>>>>>> seems > > >>>>>>>>>>> like it would be very cheap (though I understand not > > >> necessarily > > >>>>> the > > >>>>>>>>>> other > > >>>>>>>>>>> way around, but you’d need to do that anyway if you’re > > >>>>> serializing). > > >>>>>>>>>>> > > >>>>>>>>>>> Sasha Krassovsky > > >>>>>>>>>>> > > >>>>>>>>>>>> 20 мая 2023 г., в 13:00, Will Jones < > > >> will.jones...@gmail.com> > > >>>>>>>>>>> написал(а): > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks for sharing these details, Pedro. The conditional > > >>>>> branches > > >>>>>>>>>>> argument > > >>>>>>>>>>>> makes a lot of sense to me. > > >>>>>>>>>>>> > > >>>>>>>>>>>> The tensors point brings up some interesting issues. For > > >> now, > > >>>>> we've > > >>>>>>>>>>> defined > > >>>>>>>>>>>> our only tensor extension type to be built on a fixed size > > >>>> list. > > >>>>>> If a > > >>>>>>>>>> use > > >>>>>>>>>>>> case of this might be manipulating tensors with zero copy, > > >>>>> perhaps > > >>>>>>>>>> that > > >>>>>>>>>>>> suggests that we want a fixed size list variant? In > > >> addition, > > >>>>> would > > >>>>>>> we > > >>>>>>>>>>> have > > >>>>>>>>>>>> to define another extension type to be a ListView variant? > > >> Or > > >>>>> would > > >>>>>>> we > > >>>>>>>>>>> want > > >>>>>>>>>>>> to think about making extension types somehow valid across > > >>>>> various > > >>>>>>>>>>>> encodings of the same "logical type"? > > >>>>>>>>>>>> > > >>>>>>>>>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha > > >> Pedreira > > >>>>>>>>>>>>> <pedro...@meta.com.invalid> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Hi all, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> This is Pedro from the Velox team at Meta. This is my > > >> first > > >>>> time > > >>>>>>>>>> here, > > >>>>>>>>>>> so > > >>>>>>>>>>>>> nice to e-meet you! > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Adding to what Felipe said, the main reason we created > > >>>>> “ListView” > > >>>>>>>>>>> (though > > >>>>>>>>>>>>> we just call them ArrayVector/MapVector in Velox) is that, > > >>>> along > > >>>>>>> with > > >>>>>>>>>>>>> StringViews for strings, they allow us to write any > > >> columnar > > >>>>>> buffer > > >>>>>>>>>>>>> out-or-order, regardless of their types or encodings. > > >> This is > > >>>>>>>>>> naturally > > >>>>>>>>>>>>> doable for all primitive types (fixed-size), but not for > > >> types > > >>>>>> that > > >>>>>>>>>>> don’t > > >>>>>>>>>>>>> have fixed size and are required to be contiguous. The > > >>>>> StringView > > >>>>>>> and > > >>>>>>>>>>>>> ListView formats allow us to keep this invariant in Velox. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Being able to write vectors out-of-order is useful when > > >>>>> executing > > >>>>>>>>>>>>> conditionals like IF/SWITCH statements, which are > > >> pervasive > > >>>>> among > > >>>>>>> our > > >>>>>>>>>>>>> workloads. To fully vectorize it, one first evaluates the > > >>>>>>> expression, > > >>>>>>>>>>> then > > >>>>>>>>>>>>> generate a bitmap containing which rows take the THEN and > > >>>> which > > >>>>>> take > > >>>>>>>>>> the > > >>>>>>>>>>>>> ELSE branch. Then you populate all rows that match the > > >> first > > >>>>>> branch > > >>>>>>>>>> by > > >>>>>>>>>>>>> evaluating the THEN expression in a vectorized > > >> (branch-less > > >>>> and > > >>>>>>> cache > > >>>>>>>>>>>>> friendly) way, and subsequently the ELSE branch. If you > > >> can’t > > >>>>>> write > > >>>>>>>>>> them > > >>>>>>>>>>>>> out-of-order, you would either have a big branch per row > > >>>>>> dispatching > > >>>>>>>>>> to > > >>>>>>>>>>> the > > >>>>>>>>>>>>> right expression (slow), or populate two distinct vectors > > >> then > > >>>>>>>>>> merging > > >>>>>>>>>>> them > > >>>>>>>>>>>>> at the end (potentially even slower). How much faster our > > >>>>> approach > > >>>>>>> is > > >>>>>>>>>>>>> highly depends on the buffer sizes and expressions, but we > > >>>> found > > >>>>>> it > > >>>>>>>>>> to > > >>>>>>>>>>> be > > >>>>>>>>>>>>> faster enough on average to justify us extending the > > >>>> underlying > > >>>>>>>>>> layout. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> With that said, this is all within a single thread of > > >>>> execution. > > >>>>>>>>>>>>> Parallelization is done by feeding each thread its own > > >>>>>> vector/data. > > >>>>>>>>>> As > > >>>>>>>>>>>>> pointed out in a previous message, this also gives you the > > >>>>>>>>>> flexibility > > >>>>>>>>>>> to > > >>>>>>>>>>>>> implement cardinality increasing/reducing operations, but > > >> we > > >>>>> don’t > > >>>>>>>>>> use > > >>>>>>>>>>> it > > >>>>>>>>>>>>> for that purpose. Operations like filtering, joining, > > >>>> unnesting > > >>>>>> and > > >>>>>>>>>>> similar > > >>>>>>>>>>>>> are done by wrapping the internal vector in a dictionary, > > >> as > > >>>>> these > > >>>>>>>>>> need > > >>>>>>>>>>> to > > >>>>>>>>>>>>> work not only on “ListViews” but on any data types with > > >> any > > >>>>>>> encoding. > > >>>>>>>>>>> There > > >>>>>>>>>>>>> are more details on Section 4.2.1 in [1] > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Beyond this, it also gives function/kernel developers more > > >>>>>>>>>> flexibility > > >>>>>>>>>>> to > > >>>>>>>>>>>>> implement operations that manipulate Arrays/Maps. For > > >> example, > > >>>>>>>>>>> operations > > >>>>>>>>>>>>> that slice these containers can be implemented in a > > >> zero-copy > > >>>>>> manner > > >>>>>>>>>> by > > >>>>>>>>>>>>> just rearranging the lengths/offsets indices, without ever > > >>>>>> touching > > >>>>>>>>>> the > > >>>>>>>>>>>>> larger internal buffers. This is a similar motivation as > > >> for > > >>>>>>>>>> StringView > > >>>>>>>>>>>>> (think of substr(), trim(), and similar). One nice last > > >>>> property > > >>>>>> is > > >>>>>>>>>> that > > >>>>>>>>>>>>> this layout allows for overlapping ranges. This is > > >> something > > >>>>>>>>>> discussed > > >>>>>>>>>>> with > > >>>>>>>>>>>>> our ML people to allow deduping feature values in a tensor > > >>>>> (which > > >>>>>> is > > >>>>>>>>>>> fairly > > >>>>>>>>>>>>> common), but not something we have leveraged just yet. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Best, > > >>>>>>>>>>>>> -- > > >>>>>>>>>>>>> Pedro Pedreira > > >>>>>>>>>>>>> ________________________________ > > >>>>>>>>>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com> > > >>>>>>>>>>>>> Sent: Friday, May 19, 2023 10:01 AM > > >>>>>>>>>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org> > > >>>>>>>>>>>>> Cc: Pedro Eugenio Rocha Pedreira <pedro...@meta.com> > > >>>>>>>>>>>>> Subject: Re: [DISCUSS][Format] Starting the draft > > >>>> implementation > > >>>>>> of > > >>>>>>>>>> the > > >>>>>>>>>>>>> ArrayView array format > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael > > >> Taylor-Davies > > >>>>> <r. > > >>>>>>>>>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi All, > > > >> if > > >>>> we > > >>>>>>> added > > >>>>>>>>>>>>> this, do we think many Arrow and query > engine > > >>>> implementations > > >>>>>> (for > > >>>>>>>>>>>>> example, DataFusion) will be > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerStart > > >>>>>>>>>>>>> This Message Is From an External Sender > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerEnd > > >>>>>>>>>>>>> +pedroerp > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies > > >>>>>>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: > > >>>>>>>>>>>>> Hi All, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> if we added this, do we think many Arrow and query > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be > > >>>> eager > > >>>>> to > > >>>>>>>>>> add > > >>>>>>>>>>>>> full > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are > > >> they > > >>>>>> likely > > >>>>>>>>>> to > > >>>>>>>>>>>>> just > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries? > > >>>>>>>>>>>>> I can't speak for query engines in general, but at least > > >> for > > >>>>>>> arrow-rs > > >>>>>>>>>>>>> and by extension DataFusion, and based on my current > > >>>>> understanding > > >>>>>>> of > > >>>>>>>>>>>>> the use-cases I would be rather hesitant to add support > > >> to the > > >>>>>>>>>> kernels > > >>>>>>>>>>>>> for this array type, definitely instead favouring > > >> conversion > > >>>> at > > >>>>>> the > > >>>>>>>>>>>>> edges. We already have issues with the amount of code > > >>>> generation > > >>>>>>>>>>>>> resulting in binary bloat and long compile times, and I > > >> worry > > >>>>> this > > >>>>>>>>>> would > > >>>>>>>>>>>>> worsen this situation whilst not really providing > > >> compelling > > >>>>>>>>>> advantages > > >>>>>>>>>>>>> for the vast majority of workloads that don't interact > > >> with > > >>>>> Velox. > > >>>>>>>>>>>>> Whilst I can definitely see that the ListView > > >> representation > > >>>> is > > >>>>>>>>>> probably > > >>>>>>>>>>>>> a better way to represent variable length lists than what > > >>>> arrow > > >>>>>>>>>> settled > > >>>>>>>>>>>>> upon, I'm not yet convinced it is sufficiently better to > > >>>>>> incentivise > > >>>>>>>>>>>>> broad ecosystem adoption. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Kind Regards, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Raphael Taylor-Davies > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> On 11/05/2023 21:20, Will Jones wrote: > > >>>>>>>>>>>>>> Hi Felipe, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks for the additional details. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to > > >> the > > >>>>>> array > > >>>>>>>>>> from > > >>>>>>>>>>>>>>> different threads without care for strict ordering. > > >> Only the > > >>>>>>>>>> offsets > > >>>>>>>>>>>>> array > > >>>>>>>>>>>>>>> has to be written according to logical order but that is > > >>>>>>>>>> potentially a > > >>>>>>>>>>>>> much > > >>>>>>>>>>>>>>> smaller buffer than the values buffer. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> It still seems to me like applications are still pretty > > >>>> niche, > > >>>>>> as I > > >>>>>>>>>>>>> suspect > > >>>>>>>>>>>>>> in most cases the benefits are outweighed by the costs. > > >> The > > >>>>>> benefit > > >>>>>>>>>>> here > > >>>>>>>>>>>>>> seems pretty limited: if you are trying to split work > > >> between > > >>>>>>>>>> threads, > > >>>>>>>>>>>>>> usually you will have other levels such as array chunks > > >> to > > >>>>>>>>>> parallelize. > > >>>>>>>>>>>>> And > > >>>>>>>>>>>>>> if you have an incoming stream of row data, you'll want > > >> to > > >>>>> append > > >>>>>>> in > > >>>>>>>>>>>>>> predictable order to match the order of the other > > >> arrays. Am > > >>>> I > > >>>>>>>>>> missing > > >>>>>>>>>>>>>> something? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> And, IIUC, the cost of using ListView with out-of-order > > >>>> values > > >>>>>> over > > >>>>>>>>>>>>>> ListArray is you lose memory locality; the values of > > >> element > > >>>> 2 > > >>>>>> are > > >>>>>>>>>> no > > >>>>>>>>>>>>>> longer adjacent to the values of element 1. What do you > > >> think > > >>>>>> about > > >>>>>>>>>>> that > > >>>>>>>>>>>>>> tradeoff? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I don't mean to be difficult about this. I'm excited for > > >> both > > >>>>> the > > >>>>>>>>>> REE > > >>>>>>>>>>> and > > >>>>>>>>>>>>>> StringView arrays, but this one I'm not so sure about > > >> yet. I > > >>>>>>> suppose > > >>>>>>>>>>>>> what I > > >>>>>>>>>>>>>> am trying to ask is, if we added this, do we think many > > >> Arrow > > >>>>> and > > >>>>>>>>>> query > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be > > >>>> eager > > >>>>> to > > >>>>>>>>>> add > > >>>>>>>>>>>>> full > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are > > >> they > > >>>>>> likely > > >>>>>>>>>> to > > >>>>>>>>>>>>> just > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Because if it turns out to be the latter, then we might > > >> as > > >>>> well > > >>>>>> ask > > >>>>>>>>>>> Velox > > >>>>>>>>>>>>>> to export this type as ListArray and save the rest of the > > >>>>>> ecosystem > > >>>>>>>>>>> some > > >>>>>>>>>>>>>> work. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Will Jones > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira > > >> Carvalho < > > >>>>>>>>>>>>>> felipe...@gmail.com<mailto:felipe...@gmail.com>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Initial reason for ListView arrays in Arrow is zero-copy > > >>>>>>>>>> compatibility > > >>>>>>>>>>>>> with > > >>>>>>>>>>>>>>> Velox which uses this format. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to > > >> the > > >>>>>> array > > >>>>>>>>>> from > > >>>>>>>>>>>>>>> different threads without care for strict ordering. > > >> Only the > > >>>>>>>>>> offsets > > >>>>>>>>>>>>> array > > >>>>>>>>>>>>>>> has to be written according to logical order but that is > > >>>>>>>>>> potentially a > > >>>>>>>>>>>>> much > > >>>>>>>>>>>>>>> smaller buffer than the values buffer. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Acero kernels could take advantage of that in the > > >> future. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> In implementing ListViewArray/Type I was able to reuse > > >> some > > >>>>> C++ > > >>>>>>>>>>>>> templates > > >>>>>>>>>>>>>>> used for ListArray which can reduce some of the burden > > >> on > > >>>>> kernel > > >>>>>>>>>>>>>>> implementations that aim to work with all the types. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I’m can fix Acero kernels for working with ListView. > > >> This is > > >>>>>>>>>> similar > > >>>>>>>>>>> to > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>> work I’ve doing in kernels dealing with run-end encoded > > >>>>> arrays. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> — > > >>>>>>>>>>>>>>> Felipe > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, 26 Apr 2023 at 01:03 Will Jones < > > >>>>>> will.jones...@gmail.com > > >>>>>>>>>>>>> <mailto:will.jones...@gmail.com>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> I suppose one common use case is materializing list > > >> columns > > >>>>>> after > > >>>>>>>>>>> some > > >>>>>>>>>>>>>>>> expanding operation like a join or unnest. That's a > > >> case > > >>>>> where > > >>>>>> I > > >>>>>>>>>>> could > > >>>>>>>>>>>>>>>> imagine a lot of repetition of values. Haven't yet > > >> thought > > >>>> of > > >>>>>>>>>> common > > >>>>>>>>>>>>>>> cases > > >>>>>>>>>>>>>>>> where there is overlap but not full duplication, but am > > >>>> eager > > >>>>>> to > > >>>>>>>>>> hear > > >>>>>>>>>>>>>>> any. > > >>>>>>>>>>>>>>>> The dictionary encoding point Raphael makes is > > >> interesting, > > >>>>>>>>>>> especially > > >>>>>>>>>>>>>>>> given the existence of LargeList and FixedSizeList. For > > >>>> many > > >>>>>>>>>>>>> operations, > > >>>>>>>>>>>>>>> it > > >>>>>>>>>>>>>>>> might make more sense to just compose those existing > > >> types. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> IIUC the operations that would be unique to the > > >> ArrayView > > >>>> are > > >>>>>>> ones > > >>>>>>>>>>>>>>> altering > > >>>>>>>>>>>>>>>> the shape. One could truncate each array to a certain > > >>>> length > > >>>>>>>>>> cheaply > > >>>>>>>>>>>>>>> simply > > >>>>>>>>>>>>>>>> by replacing the sizes buffer. Or perhaps there are > > >>>>> interesting > > >>>>>>>>>>>>>>> operations > > >>>>>>>>>>>>>>>> on tensors that would benefit. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies > > >>>>>>>>>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Unless I am missing something, I think the selection > > >>>>> use-case > > >>>>>>>>>> could > > >>>>>>>>>>> be > > >>>>>>>>>>>>>>>>> equally well served by a dictionary-encoded > > >>>>>>> BinarArray/ListArray, > > >>>>>>>>>>> and > > >>>>>>>>>>>>>>>> would > > >>>>>>>>>>>>>>>>> have the benefit of not requiring any modifications > > >> to the > > >>>>>>>>>> existing > > >>>>>>>>>>>>>>>> format > > >>>>>>>>>>>>>>>>> or kernels. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> The major additional flexibility of the proposed > > >> encoding > > >>>>>> would > > >>>>>>>>>> be > > >>>>>>>>>>>>>>>>> permitting disjoint or overlapping ranges, are these > > >>>> common > > >>>>>>>>>> enough > > >>>>>>>>>>> in > > >>>>>>>>>>>>>>>>> practice to represent a meaningful bottleneck? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On 26 April 2023 01:40:14 BST, David Li < > > >>>>> lidav...@apache.org > > >>>>>>>>>>> <mailto: > > >>>>>>>>>>>>> lidav...@apache.org>> wrote: > > >>>>>>>>>>>>>>>>>> Is there a need for a 64-bit offsets version the > > >> same way > > >>>>> we > > >>>>>>>>>> have > > >>>>>>>>>>>>> List > > >>>>>>>>>>>>>>>>> and LargeList? > > >>>>>>>>>>>>>>>>>> And just to be clear, the difference with List is > > >> that > > >>>> the > > >>>>>>> lists > > >>>>>>>>>>>>> don't > > >>>>>>>>>>>>>>>>> have to be stored in their logical order (or in other > > >>>> words, > > >>>>>>>>>> offsets > > >>>>>>>>>>>>> do > > >>>>>>>>>>>>>>>> not > > >>>>>>>>>>>>>>>>> have to be nondecreasing and so we also need sizes)? > > >>>>>>>>>>>>>>>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote: > > >>>>>>>>>>>>>>>>>>> For context, there was some discussion on this back > > >> in > > >>>>> [1]. > > >>>>>> At > > >>>>>>>>>>> that > > >>>>>>>>>>>>>>>>> time > > >>>>>>>>>>>>>>>>>>> this was called "sequence view" but I do not like > > >> that > > >>>>> name. > > >>>>>>>>>>>>>>> However, > > >>>>>>>>>>>>>>>>>>> array-view array is a little confusing. Given this > > >> is > > >>>>>> similar > > >>>>>>>>>> to > > >>>>>>>>>>>>>>> list > > >>>>>>>>>>>>>>>>> can > > >>>>>>>>>>>>>>>>>>> we go with list-view array? > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to > > >> hear > > >>>>>> about > > >>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors, > > >> and in > > >>>>> what > > >>>>>>>>>>>>>>>> situations > > >>>>>>>>>>>>>>>>>>> they > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the > > >> current > > >>>>>>>>>> ListArray > > >>>>>>>>>>>>>>>>>>>> implementations. > > >>>>>>>>>>>>>>>>>>> I believe one significant benefit is that take (and > > >> by > > >>>>>> proxy, > > >>>>>>>>>>>>>>> filter) > > >>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>> sort are O(# of items) with the proposed format and > > >> O(# > > >>>> of > > >>>>>>>>>> bytes) > > >>>>>>>>>>>>>>> with > > >>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> current format. Jorge did some profiling to this > > >> effect > > >>>> in > > >>>>>>> [1]. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> [1] > > >>>>>>>>>>>>>>> > > >>>>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq< > > >>>>>>>>>>>>> > > >>>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq> > > >>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones < > > >>>>>>>>>>> will.jones...@gmail.com > > >>>>>>>>>>>>> <mailto:will.jones...@gmail.com> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>> Hi Felipe, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to > > >> hear > > >>>>>> about > > >>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors, > > >> and in > > >>>>> what > > >>>>>>>>>>>>>>>> situations > > >>>>>>>>>>>>>>>>> they > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the > > >> current > > >>>>>>>>>> ListArray > > >>>>>>>>>>>>>>>>>>>> implementations. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> IIUC it would be fairly cheap to transform a > > >> ListArray > > >>>> to > > >>>>>> an > > >>>>>>>>>>>>>>>>> ArrayView, but > > >>>>>>>>>>>>>>>>>>>> expensive to go the other way. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Will Jones > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira > > >>>> Carvalho > > >>>>> < > > >>>>>>>>>>>>>>>>>>>> felipe...@gmail.com<mailto:felipe...@gmail.com>> > > >>>> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Hi folks, > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I would like to start a public discussion on the > > >>>>> inclusion > > >>>>>>>>>> of a > > >>>>>>>>>>>>>>> new > > >>>>>>>>>>>>>>>>> array > > >>>>>>>>>>>>>>>>>>>>> format to Arrow — array-view array. The name is > > >> also > > >>>> up > > >>>>>> for > > >>>>>>>>>>>>>>> debate. > > >>>>>>>>>>>>>>>>>>>>> This format is inspired by Velox's ArrayVector > > >> format > > >>>>> [1]. > > >>>>>>>>>>>>>>>> Logically, > > >>>>>>>>>>>>>>>>>>>> this > > >>>>>>>>>>>>>>>>>>>>> array represents an array of arrays. Each element > > >> is > > >>>> an > > >>>>>>>>>>>>>>> array-view > > >>>>>>>>>>>>>>>>>>>> (offset > > >>>>>>>>>>>>>>>>>>>>> and size pair) that points to a range within a > > >> nested > > >>>>>>>>>> "values" > > >>>>>>>>>>>>>>>> array > > >>>>>>>>>>>>>>>>>>>>> (called "elements" in Velox docs). The nested > > >> array > > >>>> can > > >>>>> be > > >>>>>>> of > > >>>>>>>>>>> any > > >>>>>>>>>>>>>>>>> type, > > >>>>>>>>>>>>>>>>>>>>> which makes this format very flexible and > > >> powerful. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> [image: ../_images/array-vector.png] > > >>>>>>>>>>>>>>>>>>>>> < > > >>>>>>>>>>>>>>>> > > >>>>>>>>>> > > >>>> https://facebookincubator.github.io/velox/_images/array-vector.png > > >>>>> < > > >>>>>>>>>>>>> > > >>>>>> > https://facebookincubator.github.io/velox/_images/array-vector.png > > >>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I'm currently working on a C++ implementation and > > >> plan > > >>>>> to > > >>>>>>>>>> work > > >>>>>>>>>>>>>>> on a > > >>>>>>>>>>>>>>>>> Go > > >>>>>>>>>>>>>>>>>>>>> implementation to fulfill the two-implementations > > >>>>>>> requirement > > >>>>>>>>>>> for > > >>>>>>>>>>>>>>>>> format > > >>>>>>>>>>>>>>>>>>>>> changes. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> The draft design: > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets > > >> buffer, > > >>>>> int32 > > >>>>>>>>>> sizes > > >>>>>>>>>>>>>>>>> buffer] > > >>>>>>>>>>>>>>>>>>>>> - 1 child array: "values" as an array of the type > > >>>>>> parameter > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> validity_bitmap is used to differentiate between > > >> empty > > >>>>>> array > > >>>>>>>>>>>>>>> views > > >>>>>>>>>>>>>>>>>>>>> (sizes[i] == 0) and NULL array views > > >>>> (validity_bitmap[i] > > >>>>>> == > > >>>>>>>>>> 0). > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> When the validity_bitmap[i] is 0, both sizes and > > >>>> offsets > > >>>>>> are > > >>>>>>>>>>>>>>>>> undefined > > >>>>>>>>>>>>>>>>>>>> (as > > >>>>>>>>>>>>>>>>>>>>> usual), and when sizes[i] == 0, offsets[i] is > > >>>>> undefined. 0 > > >>>>>>> is > > >>>>>>>>>>>>>>>>> recommended > > >>>>>>>>>>>>>>>>>>>>> if setting a value is not an issue to the system > > >>>>> producing > > >>>>>>>>>> the > > >>>>>>>>>>>>>>>>> arrays. > > >>>>>>>>>>>>>>>>>>>>> offsets buffer is not required to be ordered and > > >> views > > >>>>>> don't > > >>>>>>>>>>> have > > >>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>> be > > >>>>>>>>>>>>>>>>>>>>> disjoint. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> [1] > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >> > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > >>>>>>>>>>>>> < > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >> > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>>>>>> Felipe O. Carvalho > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >> > > > > > > > >