hi all,

I'm just catching up on this thread after having taken a look at the format
PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02
from having spent a great deal less time on this project than others.

The original motivation I had for bringing up the idea of adding the
StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory
format (though not necessarily the IPC format) was to provide a path for
zero-copy interoperability in some cases with these systems when dealing
with strings, and to enhance performance within Arrow-applications (setting
aside the external interop goal) in scenarios where being able to point to
external memory spaces could avoid a copy-and-repack step. I think it's
useful to have an zero-copy IPC-compatible string format (i.e. what was
proposed and merged into Columnar.rst) for that allows for out-of-order
construction or arrays, reuse of memory (e.g. consider the case of decoding
dictionary encoding Parquet data — not having to copy strings many times
when rehydrating string arrays), and chunked allocation — all good things
that the existing Arrow VarBinary layout does not provide for.

For the in-memory side of things, I am somewhat more of Antoine's
perspective that trying to have both in-memory (index+offset and raw
pointers) creates a kind of uncanny valley situation that may confuse users
and cause other problems (especially if the raw pointer version is only
found in the C++ library). The raw pointer version also cannot be
validated, but I see validation as less of a requirement and more of a
"nice to have" (I realize others see validation as more of a requirement).

* I see the raw-pointer type has having more net utility (going back to the
original motivation), but I also see how it is problematic for some non-C++
implementations.
* The index-offset version is intrinsic value over the existing "dense"
varbinary layout (per some of the benefits above) but does not satisfy the
external interoperability goal with systems that are becoming more popular
month over month
* Incoming data from external systems that use the raw pointer model have
to be serialized (and perhaps repacked) to the index-offset model. This
isn't ideal — going the other way (from index-offset to raw pointer) is
just a pointer swizzle, comparatively inexpensive.

So it seems like we have several paths available, none of them wholly
satisfactory:

1. Essentially what's in the existing PR — the raw pointer variant which is
"non-standard"
2. Pick one and only one for in memory — I think the raw pointer version is
more useful given that swizzling from index-offset is pretty cheap. But the
raw pointer version can't be validated safely and is problematic for e.g.
Rust. Picking the index-offset version means that the external ecosystem of
columnar engines won't be that much closer aligned to Arrow than they are
now.
3. Implement the raw pointer variant as an extension type in C++ / C ABI.
This seems potentially useful but given that it would likely be disfavored
for data originating from Arrow-land, there would be fewer scenarios where
zero-copy interop for strings is achieved

This is difficult and I don't know what the best answer is, but personally
my inclination has been toward choices that are utilitarian and help with
alignment and cohesion in the open source ecosystem.

- Wes

On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou <anto...@python.org> wrote:

>
> To make things clear, any of the factory functions listed below create a
> type that maps exactly onto an Arrow columnar layout:
> https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
>
> For example, calling `arrow::dictionary` creates a dictionary type that
> exactly represents the dictionary layout specified in
>
> https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
>
> Similarly, if you use any of the builders listed below, what you will
> get at the end is data that complies with the Arrow columnar specification:
> https://arrow.apache.org/docs/dev/cpp/api/builder.html
>
> All the core Arrow C++ APIs create and process data which complies with
> the Arrow specification, and which is interoperable with other Arrow
> implementations.
>
> Conversely, non-Arrow data such as CSV or Parquet (or Python lists,
> etc.) goes through dedicated converters. There is no ambiguity.
>
>
> Creating top-level utilities that create non-Arrow data introduces
> confusion and ambiguity as to what Arrow is. Users who haven't studied
> the spec in detail - which is probably most users of Arrow
> implementations - will call `arrow::string_view(raw_pointers=true)` and
> might later discover that their data cannot be shared with other
> implementations (or, if it can, there will be an unsuspected conversion
> cost at the edge).
>
> It also creates a risk of introducing a parallel Arrow-like ecosystem
> based on the superset of data layouts understood by Arrow C++. People
> may feel encouraged to code for that ecosystem, pessimizing
> interoperability with non-C++ runtimes.
>
> Which is why I think those APIs, however convenient, also go against the
> overarching goals of the Arrow project.
>
>
> If we want to keep such convenience APIs as part of Arrow C++, they
> should be clearly flagged as being non-Arrow compliant.
>
> It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by
> specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).
>
> But, they could be also be provided by a distinct library.
>
> Regards
>
> Antoine.
>
>
>
> Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :
> >
> > Hi Ben,
> >
> > Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :
> >>
> >> @Antoine
> >>> What this PR is creating is an "unofficial" Arrow format, with data
> >> types exposed in Arrow C++ that are not part of the Arrow standard, but
> >> are exposed as if they were.
> >>
> >> We already do this in every implementation of the arrow format I'm
> >> aware of: it's more convenient to consider dictionary as a data type
> >> even though the spec says that it is a field property.
> >
> > I'm not sure I understand your point. Dictionary encoding is part of the
> > Arrow spec, and considering it as a data type is an API choice that does
> > not violate the spec.
> >
> > Raw pointers in string views is just not an Arrow format.
> >
> > Regards
> >
> > Antoine.
>

Reply via email to