hi all, I'm just catching up on this thread after having taken a look at the format PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02 from having spent a great deal less time on this project than others.
The original motivation I had for bringing up the idea of adding the StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory format (though not necessarily the IPC format) was to provide a path for zero-copy interoperability in some cases with these systems when dealing with strings, and to enhance performance within Arrow-applications (setting aside the external interop goal) in scenarios where being able to point to external memory spaces could avoid a copy-and-repack step. I think it's useful to have an zero-copy IPC-compatible string format (i.e. what was proposed and merged into Columnar.rst) for that allows for out-of-order construction or arrays, reuse of memory (e.g. consider the case of decoding dictionary encoding Parquet data — not having to copy strings many times when rehydrating string arrays), and chunked allocation — all good things that the existing Arrow VarBinary layout does not provide for. For the in-memory side of things, I am somewhat more of Antoine's perspective that trying to have both in-memory (index+offset and raw pointers) creates a kind of uncanny valley situation that may confuse users and cause other problems (especially if the raw pointer version is only found in the C++ library). The raw pointer version also cannot be validated, but I see validation as less of a requirement and more of a "nice to have" (I realize others see validation as more of a requirement). * I see the raw-pointer type has having more net utility (going back to the original motivation), but I also see how it is problematic for some non-C++ implementations. * The index-offset version is intrinsic value over the existing "dense" varbinary layout (per some of the benefits above) but does not satisfy the external interoperability goal with systems that are becoming more popular month over month * Incoming data from external systems that use the raw pointer model have to be serialized (and perhaps repacked) to the index-offset model. This isn't ideal — going the other way (from index-offset to raw pointer) is just a pointer swizzle, comparatively inexpensive. So it seems like we have several paths available, none of them wholly satisfactory: 1. Essentially what's in the existing PR — the raw pointer variant which is "non-standard" 2. Pick one and only one for in memory — I think the raw pointer version is more useful given that swizzling from index-offset is pretty cheap. But the raw pointer version can't be validated safely and is problematic for e.g. Rust. Picking the index-offset version means that the external ecosystem of columnar engines won't be that much closer aligned to Arrow than they are now. 3. Implement the raw pointer variant as an extension type in C++ / C ABI. This seems potentially useful but given that it would likely be disfavored for data originating from Arrow-land, there would be fewer scenarios where zero-copy interop for strings is achieved This is difficult and I don't know what the best answer is, but personally my inclination has been toward choices that are utilitarian and help with alignment and cohesion in the open source ecosystem. - Wes On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou <anto...@python.org> wrote: > > To make things clear, any of the factory functions listed below create a > type that maps exactly onto an Arrow columnar layout: > https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions > > For example, calling `arrow::dictionary` creates a dictionary type that > exactly represents the dictionary layout specified in > > https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout > > Similarly, if you use any of the builders listed below, what you will > get at the end is data that complies with the Arrow columnar specification: > https://arrow.apache.org/docs/dev/cpp/api/builder.html > > All the core Arrow C++ APIs create and process data which complies with > the Arrow specification, and which is interoperable with other Arrow > implementations. > > Conversely, non-Arrow data such as CSV or Parquet (or Python lists, > etc.) goes through dedicated converters. There is no ambiguity. > > > Creating top-level utilities that create non-Arrow data introduces > confusion and ambiguity as to what Arrow is. Users who haven't studied > the spec in detail - which is probably most users of Arrow > implementations - will call `arrow::string_view(raw_pointers=true)` and > might later discover that their data cannot be shared with other > implementations (or, if it can, there will be an unsuspected conversion > cost at the edge). > > It also creates a risk of introducing a parallel Arrow-like ecosystem > based on the superset of data layouts understood by Arrow C++. People > may feel encouraged to code for that ecosystem, pessimizing > interoperability with non-C++ runtimes. > > Which is why I think those APIs, however convenient, also go against the > overarching goals of the Arrow project. > > > If we want to keep such convenience APIs as part of Arrow C++, they > should be clearly flagged as being non-Arrow compliant. > > It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by > specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`). > > But, they could be also be provided by a distinct library. > > Regards > > Antoine. > > > > Le 28/09/2023 à 09:01, Antoine Pitrou a écrit : > > > > Hi Ben, > > > > Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit : > >> > >> @Antoine > >>> What this PR is creating is an "unofficial" Arrow format, with data > >> types exposed in Arrow C++ that are not part of the Arrow standard, but > >> are exposed as if they were. > >> > >> We already do this in every implementation of the arrow format I'm > >> aware of: it's more convenient to consider dictionary as a data type > >> even though the spec says that it is a field property. > > > > I'm not sure I understand your point. Dictionary encoding is part of the > > Arrow spec, and considering it as a data type is an API choice that does > > not violate the spec. > > > > Raw pointers in string views is just not an Arrow format. > > > > Regards > > > > Antoine. >