> I don't think "we have to adjust the Arrow format so that existing > internal representations become Arrow-compliant without any > (re-)implementation effort" is a reasonable design principle.
I agree with this statement from Antoine -- given the Arrow community has standardized an addition to the format with StringView, I think it would help to get some input from those at DuckDB and Velox on their perspective Andrew On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid> wrote: > Oh I'm with you on it being a precedent we want to be very careful about > setting, but if there isn't a meaningful performance difference, we may > be able to sidestep that discussion entirely. > > On 02/10/2023 14:11, Antoine Pitrou wrote: > > > > Even if performance were significant better, I don't think it's a good > > enough reason to add these representations to Arrow. By construction, > > a standard cannot continuously chase the performance state of art, it > > has to weigh the benefits of performance improvements against the > > increased cost for the ecosystem (for example the cost of adapting to > > frequent standard changes and a growing standard size). > > > > We have extension types which could reasonably be used for > > non-standard data types, especially the kind that are motivated by > > leading-edge performance research and innovation and come with unusual > > constraints (such as requiring trusting and dereferencing raw pointers > > embedded in data buffers). There could even be an argument for making > > some of them canonical extension types if there's enough anteriority > > in favor. > > > > Regards > > > > Antoine. > > > > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit : > >> I think what would really help would be some concrete numbers, do we > >> have any numbers comparing the performance of the offset and pointer > >> based representations? If there isn't a significant performance > >> difference between them, would the systems that currently use a > >> pointer-based approach be willing to meet us in the middle and switch to > >> an offset based encoding? This to me feels like it would be the best > >> outcome for the ecosystem as a whole. > >> > >> Kind Regards, > >> > >> Raphael > >> > >> On 02/10/2023 13:50, Antoine Pitrou wrote: > >>> > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit : > >>>>> > >>>>> I would also assert that another way to reduce this risk is to add > >>>>> some prose to the relevant sections of the columnar format > >>>>> specification doc to clearly explain that a raw pointers variant of > >>>>> the layout, while not part of the official spec, may be > >>>>> implemented in > >>>>> some Arrow libraries. > >>>> > >>>> I've lost a little context but on all the concerns of adding raw > >>>> pointers > >>>> as an official option to the spec. But I see making raw-pointer > >>>> variants > >>>> the best path forward. > >>>> > >>>> Things captured from this thread or seem obvious at least to me: > >>>> 1. Divergence of IPC spec from in-memory/C-ABI spec? > >>>> 2. More parts of the spec to cover. > >>>> 3. In-compatibility with some languages > >>>> 4. Validation (in my mind different use-cases require different > >>>> levels of > >>>> validation, so this is a little bit less of a concern in my mind). > >>>> > >>>> I think the broader issue is how we think about compatibility with > >>>> other > >>>> systems. For instance, what happens if Velox and DuckDb start adding > >>>> new > >>>> divergent memory layouts? Are we expecting to add them to the spec? > >>> > >>> This is a slippery slope. The more Arrow has a policy of integrating > >>> existing practices simply because they exist, the more the Arrow > >>> format will become _à la carte_, with different implementations > >>> choosing to implement whatever they want to spend their engineering > >>> effort on (you can see this occur, in part, on the Parquet format with > >>> its many different encodings, compression algorithms and a 96-bit > >>> timestamp type). > >>> > >>> We _have_ to think carefully about the middle- and long-term future of > >>> the format when adopting new features. > >>> > >>> In this instance, we are doing a large part of the effort by adopting > >>> a string view format with variadic buffers, inlined prefixes and > >>> offset-based views into those buffers. But some implementations with > >>> historically different internal representations will have to share > >>> part of the effort to align with the newly standardized format. > >>> > >>> I don't think "we have to adjust the Arrow format so that existing > >>> internal representations become Arrow-compliant without any > >>> (re-)implementation effort" is a reasonable design principle. > >>> > >>> Regards > >>> > >>> Antoine. >