Arrow has at least 7 native "official" implementations (Java, Rust, Golang, C#, Javascript, Julia and C++), 5 bindings on C++ (C, Ruby, Python, R, and Matlab) and likely other implementations (like arrow2 in rust)
I think it is worth remembering that depending on what level of support ListView aspires to, such an addition could require non trivial changes to many / all of those implementations (and the APIs they expose). Andrew On Wed, Jun 14, 2023 at 12:53 PM Felipe Oliveira Carvalho < felipe...@gmail.com> wrote: > General approach to alternative formats aside, in the specific case of > ListView, I think the implementation complexity is being overestimated in > these discussions. > > The C++ Arrow implementation shares a lot of code between List and > LargeList. And with some tweaks, I'm able to share that common > infrastructure for ListView as well. [1] > > ListView is similar to list: it doesn't require offsets to be sorted and > adds an extra buffer containing sizes. For symmetry with the List and > LargeList types (FixedSizeList not included), I'm going to propose we add a > LargeListView. That is not part of the draft implementation yet, but seems > like an obvious thing to have now that I implemented the `if_else` > specialization. [2] David Li asked about this above and I can confirm now > that 64-bit version of ListView (LargeListView) is in the plans. > > Trying to avoid re-implementing some kernels is not a good goal to chase, > IMO, because kernels need tweaks to take advantage of the format. > > [1] https://github.com/apache/arrow/pull/35345 > [2] https://github.com/felipecrv/arrow/commits/list_view_backup > -- > Felipe > > On Wed, Jun 14, 2023 at 12:08 PM Weston Pace <weston.p...@gmail.com> > wrote: > > > > perhaps we could support this use-case as > > > a canonical extension type over dictionary encoded, variable-sized > > > arrays > > > > I believe this suggestion is valid and could be used to solve the if-else > > case. The algorithm, if I understand it, would be roughly: > > > > ``` > > // Note: Simple pseudocode, vectorization left as exercise for the reader > > auto indices_builder = ... > > auto list_builder = ... > > indices_builder.resize(batch.length); > > Array condition_mask = condition.EvaluateBatch(batch); > > for row_index in selected_rows(condition_mask): > > indices_builder[row_index] = list_builder.CurrentLength(); > > list_builder.Append(if_expr.EvaluateRow(batch, row_index)) > > for row_index in unselected_rows(condition_mask): > > indices_builder[row_index] = list_builder.CurrentLength(); > > list_builder.Append(else_expr.EvaluateRow(batch, row_index)) > > return DictionaryArray(indices_builder.Finish(), list_builder.Finish()) > > ``` > > > > I also agree this is a slightly awkward use of dictionaries (e.g. the > > dictionary would have the same size as the # of indices) and perhaps not > > the most intuitive way to solve the problem. > > > > My gut reaction is simply "an improved if/else kernel is not, alone, > enough > > justification for a new layout" and yet... > > > > I think we are seeing the start (I hope) of a trend where Arrow is not > just > > used "between systems" (e.g. to shuttle data from one place to another, > or > > between a query engine and a visualization tool) but also "within > systems" > > (e.g. UDFs, bespoke file formats and temporary tables, between workers > in a > > distributed query engine). When arrow is used "within systems" I think > > both the number of bespoke formats and the significance of conversion > cost > > increases. For example, it's easy to say that Velox should convert at > the > > boundary as data leaves Velox. But what if Velox (or datafusion or ...) > > were to define an interface for UDFs. Would we want to use Arrow there > > (e.g. the C data interface is a good fit)? If so, wouldn't the > conversion > > cost be more significant? > > > > > Also, I'm very lukewarm towards the concept of "alternative layouts" > > > suggested somewhere else in this thread. It does not seem a good choice > > > to complexify the Arrow format that much. > > > > I think, in my opinion, this depends on how many of these alternative > > layouts exist. If there are just a few, then I agree, we should just > adopt > > them as formal first-class layouts. If there are many, then I think it > > will be too much complexity in Arrow to have all the different choices. > > Or, we could say there are many, but the alternatives don't belong in > Arrow > > at all. In that case I think it's the same question as the above > > paragraph, "do we want Arrow to be used within systems? Or just between > > systems?" > > > > > > On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > I agree that ListView cannot be an extension type, given that it > > > features a new layout, and therefore cannot reasonably be backed by an > > > existing storage type (AFAICT). > > > > > > Also, I'm very lukewarm towards the concept of "alternative layouts" > > > suggested somewhere else in this thread. It does not seem a good choice > > > to complexify the Arrow format that much. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 07/06/2023 à 00:21, Felipe Oliveira Carvalho a écrit : > > > > +1 on what Ian said. > > > > > > > > And as I write kernels for this new format, I’m learning that it’s > > > possible > > > > to re-use the common infrastructure used by List and LargeList to > > > implement > > > > the ListView related features with some adjustments. > > > > > > > > IMO having this format as a second-class citizen would more likely > > > > complicate things because it would make this unification harder. > > > > > > > > — > > > > Felipe > > > > > > > > On Tue, 6 Jun 2023 at 18:45 Ian Cook <ianmc...@apache.org> wrote: > > > > > > > >> To clarify why we cannot simply propose adding ListView as a new > > > >> “canonical extension type”: The extension type mechanism in Arrow > > > >> depends on the underlying data being organized in an existing Arrow > > > >> layout—that way an implementation that does not support the > extension > > > >> type can still handle the underlying data. But ListView is a wholly > > > >> new layout. > > > >> > > > >> I strongly agree with Weston’s idea that it is a good time for Arrow > > > >> to introduce the notion of “canonical alternative layouts.” > > > >> > > > >> Taken together, I think that canonical extension types and canonical > > > >> alternative layouts could serve as an “incubator” for proposed new > > > >> representations. For example, if a proposed canonical alternative > > > >> layout ends up being broadly adopted, then that will serve as a > signal > > > >> that we should consider adding it as a primary layout in the core > > > >> spec. > > > >> > > > >> It seems to me that most projects that are implementing Arrow today > > > >> are not aiming to provide complete coverage of Arrow; rather they > are > > > >> adopting Arrow because of its role as a standard and they are > > > >> implementing only as much of the Arrow standard as they require to > > > >> achieve some goal. I believe that such projects are important Arrow > > > >> stakeholders, and I believe that this proposed notion of canonical > > > >> alternative layouts will serve them well and will create > efficiencies > > > >> by standardizing implementations around a shared set of > alternatives. > > > >> > > > >> However I think that the documentation for canonical alternative > > > >> layouts should strongly encourage implementers to default to using > the > > > >> primary layouts defined in the core spec and only use alternative > > > >> layouts in cases where the primary layouts do not meet their needs. > > > >> > > > >> > > > >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield < > > emkornfi...@gmail.com> > > > >> wrote: > > > >>> > > > >>> This sounds reasonable to me but my main concern is, I'm not sure > > there > > > >> is > > > >>> a great mechanism to enforce canonical layouts don't somehow become > > > >> default > > > >>> (or the only implementation). > > > >>> > > > >>> Even for these new layouts, I think it might be worth rethinking > > > binding > > > >> a > > > >>> layout into the schema versus having a different concept of > encoding > > > (and > > > >>> changing some of the corresponding data structures). > > > >>> > > > >>> > > > >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace < > weston.p...@gmail.com> > > > >> wrote: > > > >>> > > > >>>> Trying to settle on one option is a fruitless endeavor. Each type > > has > > > >> pros > > > >>>> and cons. I would also predict that the largest existing usage of > > > >> Arrow is > > > >>>> shuttling data from one system to another. The newly proposed > > format > > > >>>> doesn't appear to have any significant advantage for that use case > > (if > > > >>>> anything, the existing format is arguably better as it is more > > > >> compact). > > > >>>> > > > >>>> I am very biased towards historical precedent and avoiding > breaking > > > >>>> changes. > > > >>>> > > > >>>> We have "canonical extension types", perhaps it is time for > > "canonical > > > >>>> alternative layouts". We could define it as such: > > > >>>> > > > >>>> * There are one or more primary layouts > > > >>>> * Existing layouts are automatically considered primary > layouts, > > > >> even if > > > >>>> they wouldn't > > > >>>> have been primary layouts initially (e.g. large list) > > > >>>> * A new layout, if it is semantically equivalent to another, is > > > >> considered > > > >>>> an alternative layout > > > >>>> * An alternative layout still has the same requirements for > > adoption > > > >> (two > > > >>>> implementations and a vote) > > > >>>> * An implementation should not feel pressured to rush and > > > implement > > > >> the > > > >>>> new layout. > > > >>>> It would be good if they contribute in the discussion and > > > >> consider the > > > >>>> layout and vote > > > >>>> if they feel it would be an acceptable design. > > > >>>> * We can define and vote and approve as many canonical > alternative > > > >> layouts > > > >>>> as we want: > > > >>>> * A canonical alternative layout should, at a minimum, have > some > > > >>>> reasonable justification, such as improved performance for > > > >> algorithm X > > > >>>> * Arrow implementations MUST support the primary layouts > > > >>>> * An Arrow implementation MAY support a canonical alternative, > > > >> however: > > > >>>> * An Arrow implementation MUST first support the primary > layout > > > >>>> * An Arrow implementation MUST support conversion to/from the > > > >> primary > > > >>>> and canonical layout > > > >>>> * An Arrow implementation's APIs MUST only provide data in the > > > >>>> alternative > > > >>>> layout if it is explicitly asked for (e.g. schema inference > > > should > > > >>>> prefer the primary layout). > > > >>>> * We can still vote for new primary layouts (e.g. promoting a > > > >> canonical > > > >>>> alternative) but, in these > > > >>>> votes we don't only consider the value (e.g. performance) of > > the > > > >> layout > > > >>>> but also the interoperability. > > > >>>> In other words, a layout can only become a primary layout if > > > there > > > >> is > > > >>>> significant evidence that most > > > >>>> implementations plan to adopt it. > > > >>>> > > > >>>> This lets us evolve support for new layouts more naturally. We > can > > > >>>> generally assume that users will not, initially, be aware of these > > > >>>> alternative layouts. However, everything will just work. They > may > > > >> start > > > >>>> to see a performance penalty stemming from a lack of support for > > these > > > >>>> layouts. If this performance penalty becomes significant then > they > > > >> will > > > >>>> discover it and become aware of the problem. They can then ask > > > >> whatever > > > >>>> library they are using to add support for the alternative layout. > > As > > > >>>> enough users find a need for it then libraries will add support. > > > >>>> Eventually, enough libraries will support it that we can adopt it > > as a > > > >>>> primary layout. > > > >>>> > > > >>>> Also, it allows libraries to adopt alternative layouts more > > > >> aggressively if > > > >>>> they would like while still hopefully ensuring that we eventually > > all > > > >>>> converge on the same implementation of the alternative layout. > > > >>>> > > > >>>> On Mon, May 22, 2023 at 9:35 AM Will Jones < > will.jones...@gmail.com > > > > > > >>>> wrote: > > > >>>> > > > >>>>> Hello Arrow devs, > > > >>>>> > > > >>>>> I don't understand why we would start deprecating features in the > > > >> Arrow > > > >>>>>> format. Even starting this talk might already be a bad idea > > > >> PR-wise. > > > >>>>>> > > > >>>>> > > > >>>>> I agree we don't want to make breaking changes to the Arrow > format. > > > >> But > > > >>>>> several maintainers have already stated they have no interest in > > > >>>>> maintaining both list types with full compute functionality > [1][2], > > > >> so I > > > >>>>> think it's very likely one list type or the other will be > > > >>>>> implicitly preferred in the ecosystem if this data type was > added. > > > >> If > > > >>>>> that's the case, I'd prefer that we agreed as a community which > one > > > >>>> should > > > >>>>> be preferred. Maybe that's not the best path; it's just one way > for > > > >> us to > > > >>>>> balance stability, maintenance burden, and relevance. > > > >>>>> > > > >>>>> Can someone help distill down the primary rationale and usecase > for > > > >>>>>> adding ArrayView to the Arrow Spec? > > > >>>>>> > > > >>>>> > > > >>>>> Looking back at that old thread, I think one of the main > > motivations > > > >> is > > > >>>> to > > > >>>>> try to prevent query engine implementers from feeling there is a > > > >> tradeoff > > > >>>>> between having state-of-the-art performance and being > Arrow-native. > > > >> For > > > >>>>> some of the new array types, we had both Velox and DuckDB use > them, > > > >> so it > > > >>>>> was reasonable to expect they were innovations that might > > > >> proliferate. > > > >>>> I'm > > > >>>>> not sure if the ArrayView is part of that. From Wes earlier [3]: > > > >>>>> > > > >>>>> The idea is that in a world of data and query federation (for > > > >> example, > > > >>>>>> consider [1] where Arrow is being used as a data federation > layer > > > >> with > > > >>>>> many > > > >>>>>> query engines), we want to increase the amount of data in-flight > > > >> and > > > >>>>>> in-memory that is in Arrow format. So if query engines are > having > > > >> to > > > >>>>> depart > > > >>>>>> substantially from the Arrow format to get performance, then > this > > > >>>>> creates a > > > >>>>>> potential lose-lose situation: * Depart from Arrow: get better > > > >>>>> performance > > > >>>>>> but pay serialization costs to read and write Arrow (the > > > >> performance > > > >>>> and > > > >>>>>> resource utilization benefits outweigh the serialization costs). > > > >> This > > > >>>>> puts > > > >>>>>> additional pressure on query engines to build specialized > > > >> components > > > >>>> for > > > >>>>>> solving problems rather than making use of off-the-shelf > > components > > > >>>> that > > > >>>>>> use Arrow. This has knock-on effects on ecosystem > fragmentation. * > > > >> Or > > > >>>> use > > > >>>>>> Arrow, and accept suboptimal query processing performance > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>>> Will mentions one usecase is Velox consuming python UDF output, > > which > > > >>>> seems > > > >>>>>> to be mostly about how fast Velox can consume this format, not > how > > > >> fast > > > >>>>> it > > > >>>>>> can be written. Are there other usecases? > > > >>>>>> > > > >>>>> > > > >>>>> To be clear, I don't know if that's the use case they want. > That's > > > >> just > > > >>>> me > > > >>>>> speculating. > > > >>>>> > > > >>>>> I still have some questions myself: > > > >>>>> > > > >>>>> 1. Is this array type currently only used in Velox? (not DuckDB > > like > > > >> some > > > >>>>> of the other new types?) What evidence do we have that it will > > become > > > >>>> used > > > >>>>> outside of Velox? > > > >>>>> 2. We already have three list types: list, large list (64-bit > > > >> offsets), > > > >>>> and > > > >>>>> fixed size list. Do we think we will only want a view version of > > the > > > >>>> 32-bit > > > >>>>> offset variable length list? Or are we potentially talking about > > view > > > >>>>> variants for all three? > > > >>>>> > > > >>>>> Best, > > > >>>>> > > > >>>>> Will Jones > > > >>>>> > > > >>>>> [1] > > https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > > >>>>> [2] > > https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3 > > > >>>>> [3] > > https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t > > > >>>>> > > > >>>>> On Mon, May 22, 2023 at 3:48 AM Andrew Lamb < > al...@influxdata.com> > > > >>>> wrote: > > > >>>>> > > > >>>>>> Can someone help distill down the primary rationale and usecase > > for > > > >>>>>> adding ArrayView to the Arrow Spec? > > > >>>>>> > > > >>>>>> From the above discussions, the stated rationale seems to be > fast > > > >>>>>> (zero-copy) interchange with Velox. > > > >>>>>> > > > >>>>>> This thread has qualitatively enumerated the benefits of > > > >> (offset+len) > > > >>>>>> encoding over the existing Arrow ListArray (offets) approach, > but > > I > > > >>>>> haven't > > > >>>>>> seen any performance measurements that might help us to gauge > the > > > >>>>> tradeoff > > > >>>>>> in additional complexity vs runtime overhead. > > > >>>>>> > > > >>>>>> Will mentions one usecase is Velox consuming python UDF output, > > > >> which > > > >>>>> seems > > > >>>>>> to be mostly about how fast Velox can consume this format, not > how > > > >> fast > > > >>>>> it > > > >>>>>> can be written. Are there other usecases? > > > >>>>>> > > > >>>>>> Do we have numbers showing how much overhead converting to /from > > > >>>> Velox's > > > >>>>>> internal representation and the existing ListArray adds? Has > > > >> anyone in > > > >>>>>> Velox land considered adding faster support for Arrow style > > > >> ListArray > > > >>>>>> encoding? > > > >>>>>> > > > >>>>>> > > > >>>>>> Andrew > > > >>>>>> > > > >>>>>> On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou < > > anto...@python.org > > > >>> > > > >>>>> wrote: > > > >>>>>> > > > >>>>>>> > > > >>>>>>> Hi, > > > >>>>>>> > > > >>>>>>> I don't understand why we would start deprecating features in > the > > > >>>> Arrow > > > >>>>>>> format. Even starting this talk might already be a bad idea > > > >> PR-wise. > > > >>>>>>> > > > >>>>>>> As for implementing conversions at the I/O boundary, it's a > > > >>>> reasonably > > > >>>>>>> policy, but it still requires work by implementors and it's not > > > >>>> granted > > > >>>>>>> that all consumers of the Arrow format will grow such > conversions > > > >>>>>>> if/when we add non-trivial types such as ListView or > StringView. > > > >>>>>>> > > > >>>>>>> Regards > > > >>>>>>> > > > >>>>>>> Antoine. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Le 22/05/2023 à 00:39, Will Jones a écrit : > > > >>>>>>>> One more thing: Looking back on the previous discussion[1] > > > >> (which > > > >>>>>> Weston > > > >>>>>>>> pointed out in his earlier message), Jorge suggested that the > > > >> old > > > >>>>> list > > > >>>>>>>> types might be deprecated in favor of view variants [2]. > Others > > > >>>> were > > > >>>>>>>> worried that it might undermine the perception that the Arrow > > > >>>> format > > > >>>>> is > > > >>>>>>>> stable. I think it might be worth thinking about "soft > > > >> deprecating" > > > >>>>> the > > > >>>>>>> old > > > >>>>>>>> list type: suggesting new implementations prefer the list > > > >> view, but > > > >>>>>>>> reassuring that implementations should support the old format, > > > >> even > > > >>>>> if > > > >>>>>>> they > > > >>>>>>>> just convert to the new format. To be clear, this wouldn't > > > >> mean we > > > >>>>> plan > > > >>>>>>> to > > > >>>>>>>> create breaking changes in the format; but if we ever did for > > > >> other > > > >>>>>>>> reasons, the old list type might go. > > > >>>>>>>> > > > >>>>>>>> Arrow compute libraries could choose either format for compute > > > >>>>> support, > > > >>>>>>> and > > > >>>>>>>> plan to do conversion at the boundaries. Libraries that use > > > >> the new > > > >>>>>> type > > > >>>>>>>> will have cheap conversion when reading the old type. > Meanwhile > > > >>>> those > > > >>>>>>> that > > > >>>>>>>> are still on the old type will have some incentive to move > > > >> towards > > > >>>>> the > > > >>>>>>> new > > > >>>>>>>> one, since that conversion will not be as efficient. > > > >>>>>>>> > > > >>>>>>>> [1] > > > >>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > >>>>>>>> [2] > > > >>>> https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx > > > >>>>>>>> > > > >>>>>>>> On Sun, May 21, 2023 at 3:07 PM Will Jones < > > > >>>> will.jones...@gmail.com> > > > >>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Hello, > > > >>>>>>>>> > > > >>>>>>>>> I think Sasha brings up a good point, that the advantages of > > > >> this > > > >>>>>> format > > > >>>>>>>>> seem to be primarily about query processing. Other encodings > > > >> like > > > >>>>> REE > > > >>>>>>> and > > > >>>>>>>>> dictionary have space-saving advantages that justify them > > > >> simply > > > >>>> in > > > >>>>>>> terms > > > >>>>>>>>> of space efficiency (although they have query processing > > > >>>> advantages > > > >>>>> as > > > >>>>>>>>> well). As discussed, most use cases are already well served > by > > > >>>>>> existing > > > >>>>>>>>> list types and dictionary encoding. > > > >>>>>>>>> > > > >>>>>>>>> I agree that there are cases where transferring this type > > > >> without > > > >>>>>>>>> conversion would be ideal. One use case I can think of is if > > > >> Velox > > > >>>>>>> wants to > > > >>>>>>>>> be able to take Arrow-based UDFs (possibly written with > > > >> PyArrow, > > > >>>> for > > > >>>>>>>>> example) that operate on this column type and therefore wants > > > >>>>>> zero-copy > > > >>>>>>>>> exchange over the C Data Interface. > > > >>>>>>>>> > > > >>>>>>>>> One big question I have: we already have three list types: > > > >> list, > > > >>>>> large > > > >>>>>>>>> list (64-bit offsets), and fixed size list. Do we think we > > > >> will > > > >>>> only > > > >>>>>>> want a > > > >>>>>>>>> view version of the 32-bit offset variable length list? Or > > > >> are we > > > >>>>>>>>> potentially talking about view variants for all three? > > > >>>>>>>>> > > > >>>>>>>>> Best, > > > >>>>>>>>> > > > >>>>>>>>> Will Jones > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho < > > > >>>>>>>>> felipe...@gmail.com> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> The benefit of having a memory format that’s friendly to > > > >>>>>>> non-deterministic > > > >>>>>>>>>> order writes is unlocked by the transport and processing of > > > >> the > > > >>>>> data > > > >>>>>>> being > > > >>>>>>>>>> agnostic to the physical order as much as possible. > > > >>>>>>>>>> > > > >>>>>>>>>> Requiring a conversion could cancel out that benefit. But it > > > >> can > > > >>>>> be a > > > >>>>>>>>>> provisory step for compatibility between systems that don’t > > > >>>>>> understand > > > >>>>>>> the > > > >>>>>>>>>> format yet. This is similar to the situation with > compression > > > >>>>> schemes > > > >>>>>>> like > > > >>>>>>>>>> run-end encoding — the goal is processing the compressed > data > > > >>>>>> directly > > > >>>>>>>>>> without an expansion step whenever possible. > > > >>>>>>>>>> > > > >>>>>>>>>> This is why having it as part of the open Arrow format is so > > > >>>>>> important: > > > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel > > > >> and/or > > > >>>>>>>>>> vectorized compute kernels without introducing multiple > > > >>>>> incompatible > > > >>>>>>>>>> formats to the ecosystem and without imposing a conversion > > > >> step > > > >>>>>> between > > > >>>>>>>>>> the > > > >>>>>>>>>> different systems. > > > >>>>>>>>>> > > > >>>>>>>>>> — > > > >>>>>>>>>> Felipe > > > >>>>>>>>>> > > > >>>>>>>>>> On Sat, 20 May 2023 at 20:04 Aldrin > > > >> <octalene....@pm.me.invalid> > > > >>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> I don't feel like this representation is necessarily a > > > >> detail of > > > >>>>> the > > > >>>>>>>>>> query > > > >>>>>>>>>>> engine, but I am also not sure why this representation > would > > > >>>> have > > > >>>>> to > > > >>>>>>> be > > > >>>>>>>>>>> converted to a non-view format when serializing. Could you > > > >>>> clarify > > > >>>>>>>>>> that? My > > > >>>>>>>>>>> impression is that this representation could be used for > > > >>>>> persistence > > > >>>>>>> or > > > >>>>>>>>>>> data transfer, though it can be more complex to guarantee > > > >> the > > > >>>>>> portion > > > >>>>>>> of > > > >>>>>>>>>>> the buffer that an index points to is also present in > > > >> memory. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Sent from Proton Mail for iOS > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky < > > > >>>>>>>>>> krassovskysa...@gmail.com > > > >>>>>>>>>>> > > > >> <On+Sat,+May+20,+2023+at+15:00,+Sasha+Krassovsky+%3C%3Ca+href=>> > > > >>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>> Hi everyone, > > > >>>>>>>>>>> I understand that there are numerous benefits to this > > > >>>>> representation > > > >>>>>>>>>>> during query processing, but would it be fair to say that > > > >> this > > > >>>> is > > > >>>>> an > > > >>>>>>>>>>> implementation detail of the query engine? Query engines > > > >> don’t > > > >>>>>>>>>> necessarily > > > >>>>>>>>>>> need to conform to the Arrow format internally, only at > > > >>>>>> ingest/egress > > > >>>>>>>>>>> points, and performing a conversion from the non-view to > > > >> view > > > >>>>> format > > > >>>>>>>>>> seems > > > >>>>>>>>>>> like it would be very cheap (though I understand not > > > >> necessarily > > > >>>>> the > > > >>>>>>>>>> other > > > >>>>>>>>>>> way around, but you’d need to do that anyway if you’re > > > >>>>> serializing). > > > >>>>>>>>>>> > > > >>>>>>>>>>> Sasha Krassovsky > > > >>>>>>>>>>> > > > >>>>>>>>>>>> 20 мая 2023 г., в 13:00, Will Jones < > > > >> will.jones...@gmail.com> > > > >>>>>>>>>>> написал(а): > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Thanks for sharing these details, Pedro. The conditional > > > >>>>> branches > > > >>>>>>>>>>> argument > > > >>>>>>>>>>>> makes a lot of sense to me. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The tensors point brings up some interesting issues. For > > > >> now, > > > >>>>> we've > > > >>>>>>>>>>> defined > > > >>>>>>>>>>>> our only tensor extension type to be built on a fixed size > > > >>>> list. > > > >>>>>> If a > > > >>>>>>>>>> use > > > >>>>>>>>>>>> case of this might be manipulating tensors with zero copy, > > > >>>>> perhaps > > > >>>>>>>>>> that > > > >>>>>>>>>>>> suggests that we want a fixed size list variant? In > > > >> addition, > > > >>>>> would > > > >>>>>>> we > > > >>>>>>>>>>> have > > > >>>>>>>>>>>> to define another extension type to be a ListView variant? > > > >> Or > > > >>>>> would > > > >>>>>>> we > > > >>>>>>>>>>> want > > > >>>>>>>>>>>> to think about making extension types somehow valid across > > > >>>>> various > > > >>>>>>>>>>>> encodings of the same "logical type"? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha > > > >> Pedreira > > > >>>>>>>>>>>>> <pedro...@meta.com.invalid> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Hi all, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> This is Pedro from the Velox team at Meta. This is my > > > >> first > > > >>>> time > > > >>>>>>>>>> here, > > > >>>>>>>>>>> so > > > >>>>>>>>>>>>> nice to e-meet you! > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Adding to what Felipe said, the main reason we created > > > >>>>> “ListView” > > > >>>>>>>>>>> (though > > > >>>>>>>>>>>>> we just call them ArrayVector/MapVector in Velox) is > that, > > > >>>> along > > > >>>>>>> with > > > >>>>>>>>>>>>> StringViews for strings, they allow us to write any > > > >> columnar > > > >>>>>> buffer > > > >>>>>>>>>>>>> out-or-order, regardless of their types or encodings. > > > >> This is > > > >>>>>>>>>> naturally > > > >>>>>>>>>>>>> doable for all primitive types (fixed-size), but not for > > > >> types > > > >>>>>> that > > > >>>>>>>>>>> don’t > > > >>>>>>>>>>>>> have fixed size and are required to be contiguous. The > > > >>>>> StringView > > > >>>>>>> and > > > >>>>>>>>>>>>> ListView formats allow us to keep this invariant in > Velox. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Being able to write vectors out-of-order is useful when > > > >>>>> executing > > > >>>>>>>>>>>>> conditionals like IF/SWITCH statements, which are > > > >> pervasive > > > >>>>> among > > > >>>>>>> our > > > >>>>>>>>>>>>> workloads. To fully vectorize it, one first evaluates the > > > >>>>>>> expression, > > > >>>>>>>>>>> then > > > >>>>>>>>>>>>> generate a bitmap containing which rows take the THEN and > > > >>>> which > > > >>>>>> take > > > >>>>>>>>>> the > > > >>>>>>>>>>>>> ELSE branch. Then you populate all rows that match the > > > >> first > > > >>>>>> branch > > > >>>>>>>>>> by > > > >>>>>>>>>>>>> evaluating the THEN expression in a vectorized > > > >> (branch-less > > > >>>> and > > > >>>>>>> cache > > > >>>>>>>>>>>>> friendly) way, and subsequently the ELSE branch. If you > > > >> can’t > > > >>>>>> write > > > >>>>>>>>>> them > > > >>>>>>>>>>>>> out-of-order, you would either have a big branch per row > > > >>>>>> dispatching > > > >>>>>>>>>> to > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>> right expression (slow), or populate two distinct vectors > > > >> then > > > >>>>>>>>>> merging > > > >>>>>>>>>>> them > > > >>>>>>>>>>>>> at the end (potentially even slower). How much faster our > > > >>>>> approach > > > >>>>>>> is > > > >>>>>>>>>>>>> highly depends on the buffer sizes and expressions, but > we > > > >>>> found > > > >>>>>> it > > > >>>>>>>>>> to > > > >>>>>>>>>>> be > > > >>>>>>>>>>>>> faster enough on average to justify us extending the > > > >>>> underlying > > > >>>>>>>>>> layout. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> With that said, this is all within a single thread of > > > >>>> execution. > > > >>>>>>>>>>>>> Parallelization is done by feeding each thread its own > > > >>>>>> vector/data. > > > >>>>>>>>>> As > > > >>>>>>>>>>>>> pointed out in a previous message, this also gives you > the > > > >>>>>>>>>> flexibility > > > >>>>>>>>>>> to > > > >>>>>>>>>>>>> implement cardinality increasing/reducing operations, but > > > >> we > > > >>>>> don’t > > > >>>>>>>>>> use > > > >>>>>>>>>>> it > > > >>>>>>>>>>>>> for that purpose. Operations like filtering, joining, > > > >>>> unnesting > > > >>>>>> and > > > >>>>>>>>>>> similar > > > >>>>>>>>>>>>> are done by wrapping the internal vector in a dictionary, > > > >> as > > > >>>>> these > > > >>>>>>>>>> need > > > >>>>>>>>>>> to > > > >>>>>>>>>>>>> work not only on “ListViews” but on any data types with > > > >> any > > > >>>>>>> encoding. > > > >>>>>>>>>>> There > > > >>>>>>>>>>>>> are more details on Section 4.2.1 in [1] > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Beyond this, it also gives function/kernel developers > more > > > >>>>>>>>>> flexibility > > > >>>>>>>>>>> to > > > >>>>>>>>>>>>> implement operations that manipulate Arrays/Maps. For > > > >> example, > > > >>>>>>>>>>> operations > > > >>>>>>>>>>>>> that slice these containers can be implemented in a > > > >> zero-copy > > > >>>>>> manner > > > >>>>>>>>>> by > > > >>>>>>>>>>>>> just rearranging the lengths/offsets indices, without > ever > > > >>>>>> touching > > > >>>>>>>>>> the > > > >>>>>>>>>>>>> larger internal buffers. This is a similar motivation as > > > >> for > > > >>>>>>>>>> StringView > > > >>>>>>>>>>>>> (think of substr(), trim(), and similar). One nice last > > > >>>> property > > > >>>>>> is > > > >>>>>>>>>> that > > > >>>>>>>>>>>>> this layout allows for overlapping ranges. This is > > > >> something > > > >>>>>>>>>> discussed > > > >>>>>>>>>>> with > > > >>>>>>>>>>>>> our ML people to allow deduping feature values in a > tensor > > > >>>>> (which > > > >>>>>> is > > > >>>>>>>>>>> fairly > > > >>>>>>>>>>>>> common), but not something we have leveraged just yet. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Best, > > > >>>>>>>>>>>>> -- > > > >>>>>>>>>>>>> Pedro Pedreira > > > >>>>>>>>>>>>> ________________________________ > > > >>>>>>>>>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com> > > > >>>>>>>>>>>>> Sent: Friday, May 19, 2023 10:01 AM > > > >>>>>>>>>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org> > > > >>>>>>>>>>>>> Cc: Pedro Eugenio Rocha Pedreira <pedro...@meta.com> > > > >>>>>>>>>>>>> Subject: Re: [DISCUSS][Format] Starting the draft > > > >>>> implementation > > > >>>>>> of > > > >>>>>>>>>> the > > > >>>>>>>>>>>>> ArrayView array format > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael > > > >> Taylor-Davies > > > >>>>> <r. > > > >>>>>>>>>>>>> taylordavies@ googlemail. com. invalid> wrote: Hi All, > > > > >> if > > > >>>> we > > > >>>>>>> added > > > >>>>>>>>>>>>> this, do we think many Arrow and query > engine > > > >>>> implementations > > > >>>>>> (for > > > >>>>>>>>>>>>> example, DataFusion) will be > > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerStart > > > >>>>>>>>>>>>> This Message Is From an External Sender > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> ZjQcmQRYFpfptBannerEnd > > > >>>>>>>>>>>>> +pedroerp > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies > > > >>>>>>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: > > > >>>>>>>>>>>>> Hi All, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> if we added this, do we think many Arrow and query > > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be > > > >>>> eager > > > >>>>> to > > > >>>>>>>>>> add > > > >>>>>>>>>>>>> full > > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are > > > >> they > > > >>>>>> likely > > > >>>>>>>>>> to > > > >>>>>>>>>>>>> just > > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries? > > > >>>>>>>>>>>>> I can't speak for query engines in general, but at least > > > >> for > > > >>>>>>> arrow-rs > > > >>>>>>>>>>>>> and by extension DataFusion, and based on my current > > > >>>>> understanding > > > >>>>>>> of > > > >>>>>>>>>>>>> the use-cases I would be rather hesitant to add support > > > >> to the > > > >>>>>>>>>> kernels > > > >>>>>>>>>>>>> for this array type, definitely instead favouring > > > >> conversion > > > >>>> at > > > >>>>>> the > > > >>>>>>>>>>>>> edges. We already have issues with the amount of code > > > >>>> generation > > > >>>>>>>>>>>>> resulting in binary bloat and long compile times, and I > > > >> worry > > > >>>>> this > > > >>>>>>>>>> would > > > >>>>>>>>>>>>> worsen this situation whilst not really providing > > > >> compelling > > > >>>>>>>>>> advantages > > > >>>>>>>>>>>>> for the vast majority of workloads that don't interact > > > >> with > > > >>>>> Velox. > > > >>>>>>>>>>>>> Whilst I can definitely see that the ListView > > > >> representation > > > >>>> is > > > >>>>>>>>>> probably > > > >>>>>>>>>>>>> a better way to represent variable length lists than what > > > >>>> arrow > > > >>>>>>>>>> settled > > > >>>>>>>>>>>>> upon, I'm not yet convinced it is sufficiently better to > > > >>>>>> incentivise > > > >>>>>>>>>>>>> broad ecosystem adoption. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Kind Regards, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Raphael Taylor-Davies > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On 11/05/2023 21:20, Will Jones wrote: > > > >>>>>>>>>>>>>> Hi Felipe, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Thanks for the additional details. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to > > > >> the > > > >>>>>> array > > > >>>>>>>>>> from > > > >>>>>>>>>>>>>>> different threads without care for strict ordering. > > > >> Only the > > > >>>>>>>>>> offsets > > > >>>>>>>>>>>>> array > > > >>>>>>>>>>>>>>> has to be written according to logical order but that > is > > > >>>>>>>>>> potentially a > > > >>>>>>>>>>>>> much > > > >>>>>>>>>>>>>>> smaller buffer than the values buffer. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> It still seems to me like applications are still pretty > > > >>>> niche, > > > >>>>>> as I > > > >>>>>>>>>>>>> suspect > > > >>>>>>>>>>>>>> in most cases the benefits are outweighed by the costs. > > > >> The > > > >>>>>> benefit > > > >>>>>>>>>>> here > > > >>>>>>>>>>>>>> seems pretty limited: if you are trying to split work > > > >> between > > > >>>>>>>>>> threads, > > > >>>>>>>>>>>>>> usually you will have other levels such as array chunks > > > >> to > > > >>>>>>>>>> parallelize. > > > >>>>>>>>>>>>> And > > > >>>>>>>>>>>>>> if you have an incoming stream of row data, you'll want > > > >> to > > > >>>>> append > > > >>>>>>> in > > > >>>>>>>>>>>>>> predictable order to match the order of the other > > > >> arrays. Am > > > >>>> I > > > >>>>>>>>>> missing > > > >>>>>>>>>>>>>> something? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> And, IIUC, the cost of using ListView with out-of-order > > > >>>> values > > > >>>>>> over > > > >>>>>>>>>>>>>> ListArray is you lose memory locality; the values of > > > >> element > > > >>>> 2 > > > >>>>>> are > > > >>>>>>>>>> no > > > >>>>>>>>>>>>>> longer adjacent to the values of element 1. What do you > > > >> think > > > >>>>>> about > > > >>>>>>>>>>> that > > > >>>>>>>>>>>>>> tradeoff? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> I don't mean to be difficult about this. I'm excited for > > > >> both > > > >>>>> the > > > >>>>>>>>>> REE > > > >>>>>>>>>>> and > > > >>>>>>>>>>>>>> StringView arrays, but this one I'm not so sure about > > > >> yet. I > > > >>>>>>> suppose > > > >>>>>>>>>>>>> what I > > > >>>>>>>>>>>>>> am trying to ask is, if we added this, do we think many > > > >> Arrow > > > >>>>> and > > > >>>>>>>>>> query > > > >>>>>>>>>>>>>> engine implementations (for example, DataFusion) will be > > > >>>> eager > > > >>>>> to > > > >>>>>>>>>> add > > > >>>>>>>>>>>>> full > > > >>>>>>>>>>>>>> support for the type, including compute kernels? Or are > > > >> they > > > >>>>>> likely > > > >>>>>>>>>> to > > > >>>>>>>>>>>>> just > > > >>>>>>>>>>>>>> convert this type to ListArray at import boundaries? > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Because if it turns out to be the latter, then we might > > > >> as > > > >>>> well > > > >>>>>> ask > > > >>>>>>>>>>> Velox > > > >>>>>>>>>>>>>> to export this type as ListArray and save the rest of > the > > > >>>>>> ecosystem > > > >>>>>>>>>>> some > > > >>>>>>>>>>>>>> work. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Best, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Will Jones > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira > > > >> Carvalho < > > > >>>>>>>>>>>>>> felipe...@gmail.com<mailto:felipe...@gmail.com>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Initial reason for ListView arrays in Arrow is > zero-copy > > > >>>>>>>>>> compatibility > > > >>>>>>>>>>>>> with > > > >>>>>>>>>>>>>>> Velox which uses this format. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Velox kernels benefit from being able to append data to > > > >> the > > > >>>>>> array > > > >>>>>>>>>> from > > > >>>>>>>>>>>>>>> different threads without care for strict ordering. > > > >> Only the > > > >>>>>>>>>> offsets > > > >>>>>>>>>>>>> array > > > >>>>>>>>>>>>>>> has to be written according to logical order but that > is > > > >>>>>>>>>> potentially a > > > >>>>>>>>>>>>> much > > > >>>>>>>>>>>>>>> smaller buffer than the values buffer. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Acero kernels could take advantage of that in the > > > >> future. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> In implementing ListViewArray/Type I was able to reuse > > > >> some > > > >>>>> C++ > > > >>>>>>>>>>>>> templates > > > >>>>>>>>>>>>>>> used for ListArray which can reduce some of the burden > > > >> on > > > >>>>> kernel > > > >>>>>>>>>>>>>>> implementations that aim to work with all the types. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I’m can fix Acero kernels for working with ListView. > > > >> This is > > > >>>>>>>>>> similar > > > >>>>>>>>>>> to > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>> work I’ve doing in kernels dealing with run-end encoded > > > >>>>> arrays. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> — > > > >>>>>>>>>>>>>>> Felipe > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On Wed, 26 Apr 2023 at 01:03 Will Jones < > > > >>>>>> will.jones...@gmail.com > > > >>>>>>>>>>>>> <mailto:will.jones...@gmail.com>> wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> I suppose one common use case is materializing list > > > >> columns > > > >>>>>> after > > > >>>>>>>>>>> some > > > >>>>>>>>>>>>>>>> expanding operation like a join or unnest. That's a > > > >> case > > > >>>>> where > > > >>>>>> I > > > >>>>>>>>>>> could > > > >>>>>>>>>>>>>>>> imagine a lot of repetition of values. Haven't yet > > > >> thought > > > >>>> of > > > >>>>>>>>>> common > > > >>>>>>>>>>>>>>> cases > > > >>>>>>>>>>>>>>>> where there is overlap but not full duplication, but > am > > > >>>> eager > > > >>>>>> to > > > >>>>>>>>>> hear > > > >>>>>>>>>>>>>>> any. > > > >>>>>>>>>>>>>>>> The dictionary encoding point Raphael makes is > > > >> interesting, > > > >>>>>>>>>>> especially > > > >>>>>>>>>>>>>>>> given the existence of LargeList and FixedSizeList. > For > > > >>>> many > > > >>>>>>>>>>>>> operations, > > > >>>>>>>>>>>>>>> it > > > >>>>>>>>>>>>>>>> might make more sense to just compose those existing > > > >> types. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> IIUC the operations that would be unique to the > > > >> ArrayView > > > >>>> are > > > >>>>>>> ones > > > >>>>>>>>>>>>>>> altering > > > >>>>>>>>>>>>>>>> the shape. One could truncate each array to a certain > > > >>>> length > > > >>>>>>>>>> cheaply > > > >>>>>>>>>>>>>>> simply > > > >>>>>>>>>>>>>>>> by replacing the sizes buffer. Or perhaps there are > > > >>>>> interesting > > > >>>>>>>>>>>>>>> operations > > > >>>>>>>>>>>>>>>> on tensors that would benefit. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies > > > >>>>>>>>>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Unless I am missing something, I think the selection > > > >>>>> use-case > > > >>>>>>>>>> could > > > >>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>> equally well served by a dictionary-encoded > > > >>>>>>> BinarArray/ListArray, > > > >>>>>>>>>>> and > > > >>>>>>>>>>>>>>>> would > > > >>>>>>>>>>>>>>>>> have the benefit of not requiring any modifications > > > >> to the > > > >>>>>>>>>> existing > > > >>>>>>>>>>>>>>>> format > > > >>>>>>>>>>>>>>>>> or kernels. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> The major additional flexibility of the proposed > > > >> encoding > > > >>>>>> would > > > >>>>>>>>>> be > > > >>>>>>>>>>>>>>>>> permitting disjoint or overlapping ranges, are these > > > >>>> common > > > >>>>>>>>>> enough > > > >>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>> practice to represent a meaningful bottleneck? > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On 26 April 2023 01:40:14 BST, David Li < > > > >>>>> lidav...@apache.org > > > >>>>>>>>>>> <mailto: > > > >>>>>>>>>>>>> lidav...@apache.org>> wrote: > > > >>>>>>>>>>>>>>>>>> Is there a need for a 64-bit offsets version the > > > >> same way > > > >>>>> we > > > >>>>>>>>>> have > > > >>>>>>>>>>>>> List > > > >>>>>>>>>>>>>>>>> and LargeList? > > > >>>>>>>>>>>>>>>>>> And just to be clear, the difference with List is > > > >> that > > > >>>> the > > > >>>>>>> lists > > > >>>>>>>>>>>>> don't > > > >>>>>>>>>>>>>>>>> have to be stored in their logical order (or in other > > > >>>> words, > > > >>>>>>>>>> offsets > > > >>>>>>>>>>>>> do > > > >>>>>>>>>>>>>>>> not > > > >>>>>>>>>>>>>>>>> have to be nondecreasing and so we also need sizes)? > > > >>>>>>>>>>>>>>>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote: > > > >>>>>>>>>>>>>>>>>>> For context, there was some discussion on this back > > > >> in > > > >>>>> [1]. > > > >>>>>> At > > > >>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>> time > > > >>>>>>>>>>>>>>>>>>> this was called "sequence view" but I do not like > > > >> that > > > >>>>> name. > > > >>>>>>>>>>>>>>> However, > > > >>>>>>>>>>>>>>>>>>> array-view array is a little confusing. Given this > > > >> is > > > >>>>>> similar > > > >>>>>>>>>> to > > > >>>>>>>>>>>>>>> list > > > >>>>>>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>>>>>> we go with list-view array? > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to > > > >> hear > > > >>>>>> about > > > >>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors, > > > >> and in > > > >>>>> what > > > >>>>>>>>>>>>>>>> situations > > > >>>>>>>>>>>>>>>>>>> they > > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the > > > >> current > > > >>>>>>>>>> ListArray > > > >>>>>>>>>>>>>>>>>>>> implementations. > > > >>>>>>>>>>>>>>>>>>> I believe one significant benefit is that take (and > > > >> by > > > >>>>>> proxy, > > > >>>>>>>>>>>>>>> filter) > > > >>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>> sort are O(# of items) with the proposed format and > > > >> O(# > > > >>>> of > > > >>>>>>>>>> bytes) > > > >>>>>>>>>>>>>>> with > > > >>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>> current format. Jorge did some profiling to this > > > >> effect > > > >>>> in > > > >>>>>>> [1]. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> [1] > > > >>>>>>>>>>>>>>> > > > >>>>>> > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq< > > > >>>>>>>>>>>>> > > > >>>>> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > > > > > >>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones < > > > >>>>>>>>>>> will.jones...@gmail.com > > > >>>>>>>>>>>>> <mailto:will.jones...@gmail.com> > > > >>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>> Hi Felipe, > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Thanks for the introduction. I'd be interested to > > > >> hear > > > >>>>>> about > > > >>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>> applications Velox has found for these vectors, > > > >> and in > > > >>>>> what > > > >>>>>>>>>>>>>>>> situations > > > >>>>>>>>>>>>>>>>> they > > > >>>>>>>>>>>>>>>>>>>> are useful. This could be contrasted with the > > > >> current > > > >>>>>>>>>> ListArray > > > >>>>>>>>>>>>>>>>>>>> implementations. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> IIUC it would be fairly cheap to transform a > > > >> ListArray > > > >>>> to > > > >>>>>> an > > > >>>>>>>>>>>>>>>>> ArrayView, but > > > >>>>>>>>>>>>>>>>>>>> expensive to go the other way. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Best, > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Will Jones > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira > > > >>>> Carvalho > > > >>>>> < > > > >>>>>>>>>>>>>>>>>>>> felipe...@gmail.com<mailto:felipe...@gmail.com>> > > > >>>> wrote: > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Hi folks, > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> I would like to start a public discussion on the > > > >>>>> inclusion > > > >>>>>>>>>> of a > > > >>>>>>>>>>>>>>> new > > > >>>>>>>>>>>>>>>>> array > > > >>>>>>>>>>>>>>>>>>>>> format to Arrow — array-view array. The name is > > > >> also > > > >>>> up > > > >>>>>> for > > > >>>>>>>>>>>>>>> debate. > > > >>>>>>>>>>>>>>>>>>>>> This format is inspired by Velox's ArrayVector > > > >> format > > > >>>>> [1]. > > > >>>>>>>>>>>>>>>> Logically, > > > >>>>>>>>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>>>>>>> array represents an array of arrays. Each element > > > >> is > > > >>>> an > > > >>>>>>>>>>>>>>> array-view > > > >>>>>>>>>>>>>>>>>>>> (offset > > > >>>>>>>>>>>>>>>>>>>>> and size pair) that points to a range within a > > > >> nested > > > >>>>>>>>>> "values" > > > >>>>>>>>>>>>>>>> array > > > >>>>>>>>>>>>>>>>>>>>> (called "elements" in Velox docs). The nested > > > >> array > > > >>>> can > > > >>>>> be > > > >>>>>>> of > > > >>>>>>>>>>> any > > > >>>>>>>>>>>>>>>>> type, > > > >>>>>>>>>>>>>>>>>>>>> which makes this format very flexible and > > > >> powerful. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> [image: ../_images/array-vector.png] > > > >>>>>>>>>>>>>>>>>>>>> < > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>> > > > >>>> > https://facebookincubator.github.io/velox/_images/array-vector.png > > > >>>>> < > > > >>>>>>>>>>>>> > > > >>>>>> > > https://facebookincubator.github.io/velox/_images/array-vector.png > > > >>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> I'm currently working on a C++ implementation and > > > >> plan > > > >>>>> to > > > >>>>>>>>>> work > > > >>>>>>>>>>>>>>> on a > > > >>>>>>>>>>>>>>>>> Go > > > >>>>>>>>>>>>>>>>>>>>> implementation to fulfill the two-implementations > > > >>>>>>> requirement > > > >>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>> format > > > >>>>>>>>>>>>>>>>>>>>> changes. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> The draft design: > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets > > > >> buffer, > > > >>>>> int32 > > > >>>>>>>>>> sizes > > > >>>>>>>>>>>>>>>>> buffer] > > > >>>>>>>>>>>>>>>>>>>>> - 1 child array: "values" as an array of the type > > > >>>>>> parameter > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> validity_bitmap is used to differentiate between > > > >> empty > > > >>>>>> array > > > >>>>>>>>>>>>>>> views > > > >>>>>>>>>>>>>>>>>>>>> (sizes[i] == 0) and NULL array views > > > >>>> (validity_bitmap[i] > > > >>>>>> == > > > >>>>>>>>>> 0). > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> When the validity_bitmap[i] is 0, both sizes and > > > >>>> offsets > > > >>>>>> are > > > >>>>>>>>>>>>>>>>> undefined > > > >>>>>>>>>>>>>>>>>>>> (as > > > >>>>>>>>>>>>>>>>>>>>> usual), and when sizes[i] == 0, offsets[i] is > > > >>>>> undefined. 0 > > > >>>>>>> is > > > >>>>>>>>>>>>>>>>> recommended > > > >>>>>>>>>>>>>>>>>>>>> if setting a value is not an issue to the system > > > >>>>> producing > > > >>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> arrays. > > > >>>>>>>>>>>>>>>>>>>>> offsets buffer is not required to be ordered and > > > >> views > > > >>>>>> don't > > > >>>>>>>>>>> have > > > >>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>>>>> disjoint. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> [1] > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > > > > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > > >>>>>>>>>>>>> < > > > >>>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > > > > > > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>>>>> Felipe O. Carvalho > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > > > > > > > > > > > > >