Re: [DISCUSS][C++] Raw pointer string views

Andrew Lamb Mon, 02 Oct 2023 06:23:50 -0700

> I don't think "we have to adjust the Arrow format so that existing
> internal representations become Arrow-compliant without any
> (re-)implementation effort" is a reasonable design principle.


I agree with this statement from Antoine -- given the Arrow community has
standardized an addition to the format with StringView, I think it would
help to get some input from those at DuckDB and Velox on their perspective

Andrew




On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> Oh I'm with you on it being a precedent we want to be very careful about
> setting, but if there isn't a meaningful performance difference, we may
> be able to sidestep that discussion entirely.
>
> On 02/10/2023 14:11, Antoine Pitrou wrote:
> >
> > Even if performance were significant better, I don't think it's a good
> > enough reason to add these representations to Arrow. By construction,
> > a standard cannot continuously chase the performance state of art, it
> > has to weigh the benefits of performance improvements against the
> > increased cost for the ecosystem (for example the cost of adapting to
> > frequent standard changes and a growing standard size).
> >
> > We have extension types which could reasonably be used for
> > non-standard data types, especially the kind that are motivated by
> > leading-edge performance research and innovation and come with unusual
> > constraints (such as requiring trusting and dereferencing raw pointers
> > embedded in data buffers). There could even be an argument for making
> > some of them canonical extension types if there's enough anteriority
> > in favor.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> >> I think what would really help would be some concrete numbers, do we
> >> have any numbers comparing the performance of the offset and pointer
> >> based representations? If there isn't a significant performance
> >> difference between them, would the systems that currently use a
> >> pointer-based approach be willing to meet us in the middle and switch to
> >> an offset based encoding? This to me feels like it would be the best
> >> outcome for the ecosystem as a whole.
> >>
> >> Kind Regards,
> >>
> >> Raphael
> >>
> >> On 02/10/2023 13:50, Antoine Pitrou wrote:
> >>>
> >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
> >>>>>
> >>>>> I would also assert that another way to reduce this risk is to add
> >>>>> some prose to the relevant sections of the columnar format
> >>>>> specification doc to clearly explain that a raw pointers variant of
> >>>>> the layout, while not part of the official spec, may be
> >>>>> implemented in
> >>>>> some Arrow libraries.
> >>>>
> >>>> I've lost a little context but on all the concerns of adding raw
> >>>> pointers
> >>>> as an official option to the spec.  But I see making raw-pointer
> >>>> variants
> >>>> the best path forward.
> >>>>
> >>>> Things captured from this thread or seem obvious at least to me:
> >>>> 1.  Divergence of IPC spec from in-memory/C-ABI spec?
> >>>> 2.  More parts of the spec to cover.
> >>>> 3.  In-compatibility with some languages
> >>>> 4.  Validation (in my mind different use-cases require different
> >>>> levels of
> >>>> validation, so this is a little bit less of a concern in my mind).
> >>>>
> >>>> I think the broader issue is how we think about compatibility with
> >>>> other
> >>>> systems.  For instance, what happens if Velox and DuckDb start adding
> >>>> new
> >>>> divergent memory layouts?  Are we expecting to add them to the spec?
> >>>
> >>> This is a slippery slope. The more Arrow has a policy of integrating
> >>> existing practices simply because they exist, the more the Arrow
> >>> format will become _à la carte_, with different implementations
> >>> choosing to implement whatever they want to spend their engineering
> >>> effort on (you can see this occur, in part, on the Parquet format with
> >>> its many different encodings, compression algorithms and a 96-bit
> >>> timestamp type).
> >>>
> >>> We _have_ to think carefully about the middle- and long-term future of
> >>> the format when adopting new features.
> >>>
> >>> In this instance, we are doing a large part of the effort by adopting
> >>> a string view format with variadic buffers, inlined prefixes and
> >>> offset-based views into those buffers. But some implementations with
> >>> historically different internal representations will have to share
> >>> part of the effort to align with the newly standardized format.
> >>>
> >>> I don't think "we have to adjust the Arrow format so that existing
> >>> internal representations become Arrow-compliant without any
> >>> (re-)implementation effort" is a reasonable design principle.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
>

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to