Re: [DISCUSS][C++] Raw pointer string views

Micah Kornfield Sun, 01 Oct 2023 07:23:22 -0700

>
> I would also assert that another way to reduce this risk is to add
> some prose to the relevant sections of the columnar format
> specification doc to clearly explain that a raw pointers variant of
> the layout, while not part of the official spec, may be implemented in
> some Arrow libraries.


I've lost a little context but on all the concerns of adding raw pointers
as an official option to the spec.  But I see making raw-pointer variants
the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4.  Validation (in my mind different use-cases require different levels of
validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with other
systems.  For instance, what happens if Velox and DuckDb start adding new
divergent memory layouts?  Are we expecting to add them to the spec?

Thanks,
Micah

On Fri, Sep 29, 2023 at 11:11 AM Ian Cook <ianmc...@apache.org> wrote:

> I strongly agree with Ben's assertion that "the risk of a parallel
> ecosystem… is more likely to be provoked by excluding a user's vital
> use case [than by implementing support for an unofficial layout
> variant]" in the C++ library. But there seems to be a consensus here
> that there is a real risk of sowing confusion. Thank you Ben for your
> readiness to consider the suggested approaches for reducing this risk.
>
> I would also assert that another way to reduce this risk is to add
> some prose to the relevant sections of the columnar format
> specification doc to clearly explain that a raw pointers variant of
> the layout, while not part of the official spec, may be implemented in
> some Arrow libraries.
>
> Ian
>
> On Thu, Sep 28, 2023 at 2:14 PM Felipe Oliveira Carvalho
> <felipe...@gmail.com> wrote:
> >
> > My take here is that Ben did an excellent job in hiding the fact that C++
> > has two variations of the format without leaking the pointer version via
> > the interfaces through which Arrow arrays are communicated to other
> > implementations.
> >
> > As things stand right now, there is no zero-copy transfer of
> pointer-based
> > string views. Ben can give the final authoritative answer on this. The
> idea
> > of zero-copy transfers was discussed but decided against to avoid adding
> a
> > format to the spec that can't be implemented by languages that can't cast
> > arbitrary memory bytes to objects (the case for many languages that are
> not
> > C or C++).
> >
> > Having established that the spec is not "polluted" by a format that only
> > systems-languages can implement, we can look at the constraint of keeping
> > implementations completely faithful to the spec:
> >
> > Pros:
> >  - The reference implementations serve as an alternative to the spec text
> > in being a one-to-one translation of the spec
> >
> > Cons:
> > - Performance loss (it's hard to predict how many optimizations can be
> lost
> > by forcing an extra memory indirection when looping)
> > - Insensibility to the ergonomics afforded by the language
> >
> > Variations are bound to happen any time a language doesn't afford good
> > usability without conversions every time the data is used. In JavaScript,
> > for instance, the use of UTF-16 is much more widespread than the use of
> > UTF-8. It would make sense for a JavaScript implementations to keep
> string
> > arrays in UTF-16 at rest.
> >
> > Sometimes software specs are accompanied by two types of implementations:
> > the reference implementation that tries to be simple and didactic; and
> > implementations used in practice because they are allowed to deviate
> > internally, doing things in a more complicated way than the spec
> requires,
> > to achieve some practical advantage. Are all the implementations in the
> > apache/arrow of the first kind?
> >
> > --
> > Felipe
> >
> > On Thu, Sep 28, 2023 at 1:10 PM Andrew Lamb <al...@influxdata.com>
> wrote:
> >
> > > > What this PR is creating is an "unofficial" Arrow format, with data
> > > types exposed in Arrow C++ that are not part of the Arrow standard, but
> > > are exposed as if they were.
> > >
> > > I agree with Antoine here. It seems a pretty clear cut story of the C++
> > > implementation doesn't follow the spec and thus we should either
> > > 1.  Update the standard to allow raw pointers
> > > 2.  fix the C++ implementation to not have them / treat them as though
> they
> > > were
> > >
> > > If the core usecase is "arrow has the same in memory format used by
> DuckDB
> > > and Velox, and those systems can't/won't change their implementations"
> it
> > > seems like the only path forward for that usecase is to adopt their
> model
> > > (raw pointers) directly. Maybe I am missing something
> > >
> > >
> > > Andrew
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Sep 28, 2023 at 11:11 AM Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.invalid> wrote:
> > >
> > > > FWIW Rust wouldn't have issues using raw pointers, I can't speak for
> > > other
> > > > languages though. They would be more expensive to validate, but
> > > validation
> > > > is not going to be cheap regardless.
> > > >
> > > > I could definitely see a world where view types use pointers and IPC
> > > > coerces to/from the large non-view types. IPC has to copy the string
> data
> > > > regardless and re-encoding would avoid encoding masked data.
> > > >
> > > > The notion of supporting both is less of an exciting prospect... I'm
> also
> > > > not sure if it is too late to make changes at this stage.
> > > >
> > > > On 28 September 2023 15:26:57 BST, Wes McKinney <wesmck...@gmail.com
> >
> > > > wrote:
> > > > >hi all,
> > > > >
> > > > >I'm just catching up on this thread after having taken a look at the
> > > > format
> > > > >PRs, the C++ implementation PR, and this e-mail thread. So only my
> $0.02
> > > > >from having spent a great deal less time on this project than
> others.
> > > > >
> > > > >The original motivation I had for bringing up the idea of adding the
> > > > >StringView concept from DuckDB / Velox / UmbraDB to the Arrow
> in-memory
> > > > >format (though not necessarily the IPC format) was to provide a
> path for
> > > > >zero-copy interoperability in some cases with these systems when
> dealing
> > > > >with strings, and to enhance performance within Arrow-applications
> > > > (setting
> > > > >aside the external interop goal) in scenarios where being able to
> point
> > > to
> > > > >external memory spaces could avoid a copy-and-repack step. I think
> it's
> > > > >useful to have an zero-copy IPC-compatible string format (i.e. what
> was
> > > > >proposed and merged into Columnar.rst) for that allows for
> out-of-order
> > > > >construction or arrays, reuse of memory (e.g. consider the case of
> > > > decoding
> > > > >dictionary encoding Parquet data — not having to copy strings many
> times
> > > > >when rehydrating string arrays), and chunked allocation — all good
> > > things
> > > > >that the existing Arrow VarBinary layout does not provide for.
> > > > >
> > > > >For the in-memory side of things, I am somewhat more of Antoine's
> > > > >perspective that trying to have both in-memory (index+offset and raw
> > > > >pointers) creates a kind of uncanny valley situation that may
> confuse
> > > > users
> > > > >and cause other problems (especially if the raw pointer version is
> only
> > > > >found in the C++ library). The raw pointer version also cannot be
> > > > >validated, but I see validation as less of a requirement and more
> of a
> > > > >"nice to have" (I realize others see validation as more of a
> > > requirement).
> > > > >
> > > > >* I see the raw-pointer type has having more net utility (going
> back to
> > > > the
> > > > >original motivation), but I also see how it is problematic for some
> > > > non-C++
> > > > >implementations.
> > > > >* The index-offset version is intrinsic value over the existing
> "dense"
> > > > >varbinary layout (per some of the benefits above) but does not
> satisfy
> > > the
> > > > >external interoperability goal with systems that are becoming more
> > > popular
> > > > >month over month
> > > > >* Incoming data from external systems that use the raw pointer model
> > > have
> > > > >to be serialized (and perhaps repacked) to the index-offset model.
> This
> > > > >isn't ideal — going the other way (from index-offset to raw
> pointer) is
> > > > >just a pointer swizzle, comparatively inexpensive.
> > > > >
> > > > >So it seems like we have several paths available, none of them
> wholly
> > > > >satisfactory:
> > > > >
> > > > >1. Essentially what's in the existing PR — the raw pointer variant
> which
> > > > is
> > > > >"non-standard"
> > > > >2. Pick one and only one for in memory — I think the raw pointer
> version
> > > > is
> > > > >more useful given that swizzling from index-offset is pretty cheap.
> But
> > > > the
> > > > >raw pointer version can't be validated safely and is problematic for
> > > e.g.
> > > > >Rust. Picking the index-offset version means that the external
> ecosystem
> > > > of
> > > > >columnar engines won't be that much closer aligned to Arrow than
> they
> > > are
> > > > >now.
> > > > >3. Implement the raw pointer variant as an extension type in C++ / C
> > > ABI.
> > > > >This seems potentially useful but given that it would likely be
> > > disfavored
> > > > >for data originating from Arrow-land, there would be fewer scenarios
> > > where
> > > > >zero-copy interop for strings is achieved
> > > > >
> > > > >This is difficult and I don't know what the best answer is, but
> > > personally
> > > > >my inclination has been toward choices that are utilitarian and help
> > > with
> > > > >alignment and cohesion in the open source ecosystem.
> > > > >
> > > > >- Wes
> > > > >
> > > > >On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou <anto...@python.org>
> > > > wrote:
> > > > >
> > > > >>
> > > > >> To make things clear, any of the factory functions listed below
> > > create a
> > > > >> type that maps exactly onto an Arrow columnar layout:
> > > > >>
> > > >
> > >
> https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
> > > > >>
> > > > >> For example, calling `arrow::dictionary` creates a dictionary type
> > > that
> > > > >> exactly represents the dictionary layout specified in
> > > > >>
> > > > >>
> > > >
> > >
> https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
> > > > >>
> > > > >> Similarly, if you use any of the builders listed below, what you
> will
> > > > >> get at the end is data that complies with the Arrow columnar
> > > > specification:
> > > > >> https://arrow.apache.org/docs/dev/cpp/api/builder.html
> > > > >>
> > > > >> All the core Arrow C++ APIs create and process data which complies
> > > with
> > > > >> the Arrow specification, and which is interoperable with other
> Arrow
> > > > >> implementations.
> > > > >>
> > > > >> Conversely, non-Arrow data such as CSV or Parquet (or Python
> lists,
> > > > >> etc.) goes through dedicated converters. There is no ambiguity.
> > > > >>
> > > > >>
> > > > >> Creating top-level utilities that create non-Arrow data introduces
> > > > >> confusion and ambiguity as to what Arrow is. Users who haven't
> studied
> > > > >> the spec in detail - which is probably most users of Arrow
> > > > >> implementations - will call
> `arrow::string_view(raw_pointers=true)`
> > > and
> > > > >> might later discover that their data cannot be shared with other
> > > > >> implementations (or, if it can, there will be an unsuspected
> > > conversion
> > > > >> cost at the edge).
> > > > >>
> > > > >> It also creates a risk of introducing a parallel Arrow-like
> ecosystem
> > > > >> based on the superset of data layouts understood by Arrow C++.
> People
> > > > >> may feel encouraged to code for that ecosystem, pessimizing
> > > > >> interoperability with non-C++ runtimes.
> > > > >>
> > > > >> Which is why I think those APIs, however convenient, also go
> against
> > > the
> > > > >> overarching goals of the Arrow project.
> > > > >>
> > > > >>
> > > > >> If we want to keep such convenience APIs as part of Arrow C++,
> they
> > > > >> should be clearly flagged as being non-Arrow compliant.
> > > > >>
> > > > >> It could be by naming (e.g. `arrow::non_arrow_string_view()`) or
> by
> > > > >> specific namespacing (e.g.
> `non_arrow::raw_pointers_string_view()`).
> > > > >>
> > > > >> But, they could be also be provided by a distinct library.
> > > > >>
> > > > >> Regards
> > > > >>
> > > > >> Antoine.
> > > > >>
> > > > >>
> > > > >>
> > > > >> Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :
> > > > >> >
> > > > >> > Hi Ben,
> > > > >> >
> > > > >> > Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :
> > > > >> >>
> > > > >> >> @Antoine
> > > > >> >>> What this PR is creating is an "unofficial" Arrow format, with
> > > data
> > > > >> >> types exposed in Arrow C++ that are not part of the Arrow
> standard,
> > > > but
> > > > >> >> are exposed as if they were.
> > > > >> >>
> > > > >> >> We already do this in every implementation of the arrow format
> I'm
> > > > >> >> aware of: it's more convenient to consider dictionary as a data
> > > type
> > > > >> >> even though the spec says that it is a field property.
> > > > >> >
> > > > >> > I'm not sure I understand your point. Dictionary encoding is
> part of
> > > > the
> > > > >> > Arrow spec, and considering it as a data type is an API choice
> that
> > > > does
> > > > >> > not violate the spec.
> > > > >> >
> > > > >> > Raw pointers in string views is just not an Arrow format.
> > > > >> >
> > > > >> > Regards
> > > > >> >
> > > > >> > Antoine.
> > > > >>
> > > >
> > >
>

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to