Re: [DISCUSS][C++] Raw pointer string views

Neal Richardson Fri, 06 Oct 2023 09:35:12 -0700

Agreed, it's unfortunately not just a simple tradeoff. We have discussed
this a bit in [1] and in several other threads around this topic. If we say
that Arrow is about interchange and not execution, so we shouldn't adopt
the pointer version that DuckDB uses, that means we're also making
interchange harder because of the need to convert from your internal format
to the Arrow format at the boundary. Adding the pointer version to the
arrow format solves that, but creates costs elsewhere.


IIUC Ben's proposal tried to solve this tension by making it possible for
two systems to agree to use the pointer version and pass data without
serialization costs. That comes with its own risks and tradeoffs.

This feels like another case where the "canonical alternative layout"
discussed in [1] could be a way to formalize this variation and allow it to
be used but not required in all implementations. One way or another, we
need to find a way to balance the desire for Arrow to be a universal
standard with the risk of diluting the standard to accommodate every
project.

Neal

[1]: https://lists.apache.org/thread/djl9dbd7qmozxtjpfzby40gg23x0o3wo

On Fri, Oct 6, 2023 at 11:47 AM Weston Pace <weston.p...@gmail.com> wrote:

> > I feel the broader question here is what is Arrow's intended use case -
> interchange or execution
>
> The line between interchange and execution is not always clear.  For
> example, I think we would like Arrow to be considered as a standard for UDF
> libraries.
>
> On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt <m...@duckdblabs.com> wrote:
>
> > For the index vs pointer question - DuckDB went with pointers as they are
> > more flexible, and DuckDB was designed to consume data (and strings)
> from a
> > wide variety of formats in a wide variety of languages. Pointers allows
> us
> > to easily zero-copy from e.g. Python strings, R strings, Arrow strings,
> > etc. The flip side of pointers is that ownership has to be handled very
> > carefully. Our vector format is an execution-only format, and never
> leaves
> > the internals of the engine. This greatly simplifies ownership as we are
> in
> > complete control of what happens inside the engine. For an interchange
> > format that is intended for handing data between engines, I can see this
> > being more complicated and having verification being more important.
> >
> > As for the actual change:
> >
> > From an interchange perspective from DuckDB's side - the proposed
> > zero-copy integration would definitely speed up the conversion of DuckDB
> > string vectors to Arrow string vectors. In a recent benchmark that we
> have
> > performed we have found string conversion to Arrow vectors to be a
> > bottleneck in certain workloads, although we have not sufficiently
> > researched if this could be improved in other ways. It is possible this
> can
> > be alleviated without requiring changes to Arrow.
> >
> > However - in general, a new string vector format is only useful if
> > consumers also support the format. If the consumer immediately converts
> the
> > strings back into the standard Arrow string representation then there is
> no
> > benefit. The change will only move where the conversion happens (from
> > inside DuckDB to inside the consumer). As such, this change is only
> useful
> > if the broader Arrow ecosystem moves towards supporting the new string
> > format.
> >
> > From an execution perspective from DuckDB's side - it is unlikely that we
> > will switch to using Arrow as an internal format at this stage of the
> > project. While this change increases Arrow's utility as an intermediate
> > execution format, that is more relevant to projects that are currently
> > using Arrow in this manner or are planning to use Arrow in this manner.
> >
> > I feel the broader question here is what is Arrow's intended use case -
> > interchange or execution - as they are opposed in this discussion. This
> > change improves Arrow's utility as an execution format at the expense of
> > more stability in the interchange format. From my perspective Arrow is
> more
> > useful as an interchange format. When different tools communicate with
> each
> > other a standard is required. An execution format is generally not
> exposed
> > outside of the internals of the execution engine. Engines can do whatever
> > they want here - and a standard is perhaps not as useful.
> >
> > On 2023/10/02 13:21:59 Andrew Lamb wrote:
> > > > I don't think "we have to adjust the Arrow format so that existing
> > > > internal representations become Arrow-compliant without any
> > > > (re-)implementation effort" is a reasonable design principle.
> > >
> > > I agree with this statement from Antoine -- given the Arrow community
> has
> > > standardized an addition to the format with StringView, I think it
> would
> > > help to get some input from those at DuckDB and Velox on their
> > perspective
> > >
> > > Andrew
> > >
> > >
> > >
> > >
> > > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> > > <r....@googlemail.com.invalid> wrote:
> > >
> > > > Oh I'm with you on it being a precedent we want to be very careful
> > about
> > > > setting, but if there isn't a meaningful performance difference, we
> may
> > > > be able to sidestep that discussion entirely.
> > > >
> > > > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > > > >
> > > > > Even if performance were significant better, I don't think it's a
> > good
> > > > > enough reason to add these representations to Arrow. By
> construction,
> > > > > a standard cannot continuously chase the performance state of art,
> it
> > > > > has to weigh the benefits of performance improvements against the
> > > > > increased cost for the ecosystem (for example the cost of adapting
> to
> > > > > frequent standard changes and a growing standard size).
> > > > >
> > > > > We have extension types which could reasonably be used for
> > > > > non-standard data types, especially the kind that are motivated by
> > > > > leading-edge performance research and innovation and come with
> > unusual
> > > > > constraints (such as requiring trusting and dereferencing raw
> > pointers
> > > > > embedded in data buffers). There could even be an argument for
> making
> > > > > some of them canonical extension types if there's enough
> anteriority
> > > > > in favor.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > > > >> I think what would really help would be some concrete numbers, do
> we
> > > > >> have any numbers comparing the performance of the offset and
> pointer
> > > > >> based representations? If there isn't a significant performance
> > > > >> difference between them, would the systems that currently use a
> > > > >> pointer-based approach be willing to meet us in the middle and
> > switch to
> > > > >> an offset based encoding? This to me feels like it would be the
> best
> > > > >> outcome for the ecosystem as a whole.
> > > > >>
> > > > >> Kind Regards,
> > > > >>
> > > > >> Raphael
> > > > >>
> > > > >> On 02/10/2023 13:50, Antoine Pitrou wrote:
> > > > >>>
> > > > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
> > > > >>>>>
> > > > >>>>> I would also assert that another way to reduce this risk is to
> > add
> > > > >>>>> some prose to the relevant sections of the columnar format
> > > > >>>>> specification doc to clearly explain that a raw pointers
> variant
> > of
> > > > >>>>> the layout, while not part of the official spec, may be
> > > > >>>>> implemented in
> > > > >>>>> some Arrow libraries.
> > > > >>>>
> > > > >>>> I've lost a little context but on all the concerns of adding raw
> > > > >>>> pointers
> > > > >>>> as an official option to the spec.  But I see making raw-pointer
> > > > >>>> variants
> > > > >>>> the best path forward.
> > > > >>>>
> > > > >>>> Things captured from this thread or seem obvious at least to me:
> > > > >>>> 1.  Divergence of IPC spec from in-memory/C-ABI spec?
> > > > >>>> 2.  More parts of the spec to cover.
> > > > >>>> 3.  In-compatibility with some languages
> > > > >>>> 4.  Validation (in my mind different use-cases require different
> > > > >>>> levels of
> > > > >>>> validation, so this is a little bit less of a concern in my
> mind).
> > > > >>>>
> > > > >>>> I think the broader issue is how we think about compatibility
> with
> > > > >>>> other
> > > > >>>> systems.  For instance, what happens if Velox and DuckDb start
> > adding
> > > > >>>> new
> > > > >>>> divergent memory layouts?  Are we expecting to add them to the
> > spec?
> > > > >>>
> > > > >>> This is a slippery slope. The more Arrow has a policy of
> > integrating
> > > > >>> existing practices simply because they exist, the more the Arrow
> > > > >>> format will become _à la carte_, with different implementations
> > > > >>> choosing to implement whatever they want to spend their
> engineering
> > > > >>> effort on (you can see this occur, in part, on the Parquet format
> > with
> > > > >>> its many different encodings, compression algorithms and a 96-bit
> > > > >>> timestamp type).
> > > > >>>
> > > > >>> We _have_ to think carefully about the middle- and long-term
> > future of
> > > > >>> the format when adopting new features.
> > > > >>>
> > > > >>> In this instance, we are doing a large part of the effort by
> > adopting
> > > > >>> a string view format with variadic buffers, inlined prefixes and
> > > > >>> offset-based views into those buffers. But some implementations
> > with
> > > > >>> historically different internal representations will have to
> share
> > > > >>> part of the effort to align with the newly standardized format.
> > > > >>>
> > > > >>> I don't think "we have to adjust the Arrow format so that
> existing
> > > > >>> internal representations become Arrow-compliant without any
> > > > >>> (re-)implementation effort" is a reasonable design principle.
> > > > >>>
> > > > >>> Regards
> > > > >>>
> > > > >>> Antoine.
> > > >
> > >
>

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to