Given the discussion on this thread, I think the best thing we could do is 1. Do not change the Arrow spec / C++ implementation (do not add raw pointers) 2. Abandon the goal of "truly zero copy" interchange with Velox and DuckDB as unobtainable 3. Focus our efforts as a community to drive the new additions to Arrow through its various ecosystems.
My rationale is that converting DuckDB/Velox style strings to the new Arrow string view via pointer swizzling is almost certainly less costly than also copying all the string data as required with the original String Array. So while the conversion is not "zero" copy it would be "zero string data copy" which might be good enough, and is certainly better than forcing most users to convert to the original String Array, as required until the new String View is implemented across the ecosystem . I believe broad support for the newly updated Arrow String view offers far better performance/effort tradeoff (as a community) than optimizing for the last bits of conversion, though for certain usecases the tradeoff may be different. Andrew On Fri, Oct 6, 2023 at 12:35 PM Neal Richardson <neal.p.richard...@gmail.com> wrote: > Agreed, it's unfortunately not just a simple tradeoff. We have discussed > this a bit in [1] and in several other threads around this topic. If we say > that Arrow is about interchange and not execution, so we shouldn't adopt > the pointer version that DuckDB uses, that means we're also making > interchange harder because of the need to convert from your internal format > to the Arrow format at the boundary. Adding the pointer version to the > arrow format solves that, but creates costs elsewhere. > > IIUC Ben's proposal tried to solve this tension by making it possible for > two systems to agree to use the pointer version and pass data without > serialization costs. That comes with its own risks and tradeoffs. > > This feels like another case where the "canonical alternative layout" > discussed in [1] could be a way to formalize this variation and allow it to > be used but not required in all implementations. One way or another, we > need to find a way to balance the desire for Arrow to be a universal > standard with the risk of diluting the standard to accommodate every > project. > > Neal > > [1]: https://lists.apache.org/thread/djl9dbd7qmozxtjpfzby40gg23x0o3wo > > On Fri, Oct 6, 2023 at 11:47 AM Weston Pace <weston.p...@gmail.com> wrote: > > > > I feel the broader question here is what is Arrow's intended use case - > > interchange or execution > > > > The line between interchange and execution is not always clear. For > > example, I think we would like Arrow to be considered as a standard for > UDF > > libraries. > > > > On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt <m...@duckdblabs.com> > wrote: > > > > > For the index vs pointer question - DuckDB went with pointers as they > are > > > more flexible, and DuckDB was designed to consume data (and strings) > > from a > > > wide variety of formats in a wide variety of languages. Pointers allows > > us > > > to easily zero-copy from e.g. Python strings, R strings, Arrow strings, > > > etc. The flip side of pointers is that ownership has to be handled very > > > carefully. Our vector format is an execution-only format, and never > > leaves > > > the internals of the engine. This greatly simplifies ownership as we > are > > in > > > complete control of what happens inside the engine. For an interchange > > > format that is intended for handing data between engines, I can see > this > > > being more complicated and having verification being more important. > > > > > > As for the actual change: > > > > > > From an interchange perspective from DuckDB's side - the proposed > > > zero-copy integration would definitely speed up the conversion of > DuckDB > > > string vectors to Arrow string vectors. In a recent benchmark that we > > have > > > performed we have found string conversion to Arrow vectors to be a > > > bottleneck in certain workloads, although we have not sufficiently > > > researched if this could be improved in other ways. It is possible this > > can > > > be alleviated without requiring changes to Arrow. > > > > > > However - in general, a new string vector format is only useful if > > > consumers also support the format. If the consumer immediately converts > > the > > > strings back into the standard Arrow string representation then there > is > > no > > > benefit. The change will only move where the conversion happens (from > > > inside DuckDB to inside the consumer). As such, this change is only > > useful > > > if the broader Arrow ecosystem moves towards supporting the new string > > > format. > > > > > > From an execution perspective from DuckDB's side - it is unlikely that > we > > > will switch to using Arrow as an internal format at this stage of the > > > project. While this change increases Arrow's utility as an intermediate > > > execution format, that is more relevant to projects that are currently > > > using Arrow in this manner or are planning to use Arrow in this manner. > > > > > > I feel the broader question here is what is Arrow's intended use case - > > > interchange or execution - as they are opposed in this discussion. This > > > change improves Arrow's utility as an execution format at the expense > of > > > more stability in the interchange format. From my perspective Arrow is > > more > > > useful as an interchange format. When different tools communicate with > > each > > > other a standard is required. An execution format is generally not > > exposed > > > outside of the internals of the execution engine. Engines can do > whatever > > > they want here - and a standard is perhaps not as useful. > > > > > > On 2023/10/02 13:21:59 Andrew Lamb wrote: > > > > > I don't think "we have to adjust the Arrow format so that existing > > > > > internal representations become Arrow-compliant without any > > > > > (re-)implementation effort" is a reasonable design principle. > > > > > > > > I agree with this statement from Antoine -- given the Arrow community > > has > > > > standardized an addition to the format with StringView, I think it > > would > > > > help to get some input from those at DuckDB and Velox on their > > > perspective > > > > > > > > Andrew > > > > > > > > > > > > > > > > > > > > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies > > > > <r....@googlemail.com.invalid> wrote: > > > > > > > > > Oh I'm with you on it being a precedent we want to be very careful > > > about > > > > > setting, but if there isn't a meaningful performance difference, we > > may > > > > > be able to sidestep that discussion entirely. > > > > > > > > > > On 02/10/2023 14:11, Antoine Pitrou wrote: > > > > > > > > > > > > Even if performance were significant better, I don't think it's a > > > good > > > > > > enough reason to add these representations to Arrow. By > > construction, > > > > > > a standard cannot continuously chase the performance state of > art, > > it > > > > > > has to weigh the benefits of performance improvements against the > > > > > > increased cost for the ecosystem (for example the cost of > adapting > > to > > > > > > frequent standard changes and a growing standard size). > > > > > > > > > > > > We have extension types which could reasonably be used for > > > > > > non-standard data types, especially the kind that are motivated > by > > > > > > leading-edge performance research and innovation and come with > > > unusual > > > > > > constraints (such as requiring trusting and dereferencing raw > > > pointers > > > > > > embedded in data buffers). There could even be an argument for > > making > > > > > > some of them canonical extension types if there's enough > > anteriority > > > > > > in favor. > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit : > > > > > >> I think what would really help would be some concrete numbers, > do > > we > > > > > >> have any numbers comparing the performance of the offset and > > pointer > > > > > >> based representations? If there isn't a significant performance > > > > > >> difference between them, would the systems that currently use a > > > > > >> pointer-based approach be willing to meet us in the middle and > > > switch to > > > > > >> an offset based encoding? This to me feels like it would be the > > best > > > > > >> outcome for the ecosystem as a whole. > > > > > >> > > > > > >> Kind Regards, > > > > > >> > > > > > >> Raphael > > > > > >> > > > > > >> On 02/10/2023 13:50, Antoine Pitrou wrote: > > > > > >>> > > > > > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit : > > > > > >>>>> > > > > > >>>>> I would also assert that another way to reduce this risk is > to > > > add > > > > > >>>>> some prose to the relevant sections of the columnar format > > > > > >>>>> specification doc to clearly explain that a raw pointers > > variant > > > of > > > > > >>>>> the layout, while not part of the official spec, may be > > > > > >>>>> implemented in > > > > > >>>>> some Arrow libraries. > > > > > >>>> > > > > > >>>> I've lost a little context but on all the concerns of adding > raw > > > > > >>>> pointers > > > > > >>>> as an official option to the spec. But I see making > raw-pointer > > > > > >>>> variants > > > > > >>>> the best path forward. > > > > > >>>> > > > > > >>>> Things captured from this thread or seem obvious at least to > me: > > > > > >>>> 1. Divergence of IPC spec from in-memory/C-ABI spec? > > > > > >>>> 2. More parts of the spec to cover. > > > > > >>>> 3. In-compatibility with some languages > > > > > >>>> 4. Validation (in my mind different use-cases require > different > > > > > >>>> levels of > > > > > >>>> validation, so this is a little bit less of a concern in my > > mind). > > > > > >>>> > > > > > >>>> I think the broader issue is how we think about compatibility > > with > > > > > >>>> other > > > > > >>>> systems. For instance, what happens if Velox and DuckDb start > > > adding > > > > > >>>> new > > > > > >>>> divergent memory layouts? Are we expecting to add them to the > > > spec? > > > > > >>> > > > > > >>> This is a slippery slope. The more Arrow has a policy of > > > integrating > > > > > >>> existing practices simply because they exist, the more the > Arrow > > > > > >>> format will become _à la carte_, with different implementations > > > > > >>> choosing to implement whatever they want to spend their > > engineering > > > > > >>> effort on (you can see this occur, in part, on the Parquet > format > > > with > > > > > >>> its many different encodings, compression algorithms and a > 96-bit > > > > > >>> timestamp type). > > > > > >>> > > > > > >>> We _have_ to think carefully about the middle- and long-term > > > future of > > > > > >>> the format when adopting new features. > > > > > >>> > > > > > >>> In this instance, we are doing a large part of the effort by > > > adopting > > > > > >>> a string view format with variadic buffers, inlined prefixes > and > > > > > >>> offset-based views into those buffers. But some implementations > > > with > > > > > >>> historically different internal representations will have to > > share > > > > > >>> part of the effort to align with the newly standardized format. > > > > > >>> > > > > > >>> I don't think "we have to adjust the Arrow format so that > > existing > > > > > >>> internal representations become Arrow-compliant without any > > > > > >>> (re-)implementation effort" is a reasonable design principle. > > > > > >>> > > > > > >>> Regards > > > > > >>> > > > > > >>> Antoine. > > > > > > > > > > > >