@Wes 3. Implement the raw pointer variant as an extension type in C++ / C ABI.
@Andrew 1. Update the standard to allow raw pointers If adding raw pointers to the C ABI is a satisfactory compromise, then I'd be happy to draft a PR adding it. To me this seems to cover the bases of accommodating either priority and of making each representation official in all contexts where it can appear. @Antoine discover that their data cannot be shared with other implementations (or, if it can, there will be an unsuspected conversion cost at the edge). It also creates a risk of introducing a parallel Arrow-like ecosystem based on the superset of data layouts understood by Arrow C++. People may feel encouraged to code for that ecosystem, pessimizing interoperability with non-C++ runtimes. The choice of data type for a column is already one which must be made with attention to the needs of a particular application. To my mind, we might decide for enough flexibility to allow users their choice of tradeoff or we might avert that frustration by choosing a tradeoff for them. The choice between index/offset and raw pointer representation is demonstrably strongly informed by intended application. I think that the risk of a parallel ecosystem (which I agree is to be avoided!) is more likely to be provoked by excluding a user's vital use case. Sincerely, Ben Kietzman On Thu, Sep 28, 2023 at 12:10 PM Andrew Lamb <al...@influxdata.com> wrote: > > What this PR is creating is an "unofficial" Arrow format, with data > types exposed in Arrow C++ that are not part of the Arrow standard, but > are exposed as if they were. > > I agree with Antoine here. It seems a pretty clear cut story of the C++ > implementation doesn't follow the spec and thus we should either > 1. Update the standard to allow raw pointers > 2. fix the C++ implementation to not have them / treat them as though they > were > > If the core usecase is "arrow has the same in memory format used by DuckDB > and Velox, and those systems can't/won't change their implementations" it > seems like the only path forward for that usecase is to adopt their model > (raw pointers) directly. Maybe I am missing something > > > Andrew > > > > > > > On Thu, Sep 28, 2023 at 11:11 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > FWIW Rust wouldn't have issues using raw pointers, I can't speak for > other > > languages though. They would be more expensive to validate, but > validation > > is not going to be cheap regardless. > > > > I could definitely see a world where view types use pointers and IPC > > coerces to/from the large non-view types. IPC has to copy the string data > > regardless and re-encoding would avoid encoding masked data. > > > > The notion of supporting both is less of an exciting prospect... I'm also > > not sure if it is too late to make changes at this stage. > > > > On 28 September 2023 15:26:57 BST, Wes McKinney <wesmck...@gmail.com> > > wrote: > > >hi all, > > > > > >I'm just catching up on this thread after having taken a look at the > > format > > >PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02 > > >from having spent a great deal less time on this project than others. > > > > > >The original motivation I had for bringing up the idea of adding the > > >StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory > > >format (though not necessarily the IPC format) was to provide a path for > > >zero-copy interoperability in some cases with these systems when dealing > > >with strings, and to enhance performance within Arrow-applications > > (setting > > >aside the external interop goal) in scenarios where being able to point > to > > >external memory spaces could avoid a copy-and-repack step. I think it's > > >useful to have an zero-copy IPC-compatible string format (i.e. what was > > >proposed and merged into Columnar.rst) for that allows for out-of-order > > >construction or arrays, reuse of memory (e.g. consider the case of > > decoding > > >dictionary encoding Parquet data — not having to copy strings many times > > >when rehydrating string arrays), and chunked allocation — all good > things > > >that the existing Arrow VarBinary layout does not provide for. > > > > > >For the in-memory side of things, I am somewhat more of Antoine's > > >perspective that trying to have both in-memory (index+offset and raw > > >pointers) creates a kind of uncanny valley situation that may confuse > > users > > >and cause other problems (especially if the raw pointer version is only > > >found in the C++ library). The raw pointer version also cannot be > > >validated, but I see validation as less of a requirement and more of a > > >"nice to have" (I realize others see validation as more of a > requirement). > > > > > >* I see the raw-pointer type has having more net utility (going back to > > the > > >original motivation), but I also see how it is problematic for some > > non-C++ > > >implementations. > > >* The index-offset version is intrinsic value over the existing "dense" > > >varbinary layout (per some of the benefits above) but does not satisfy > the > > >external interoperability goal with systems that are becoming more > popular > > >month over month > > >* Incoming data from external systems that use the raw pointer model > have > > >to be serialized (and perhaps repacked) to the index-offset model. This > > >isn't ideal — going the other way (from index-offset to raw pointer) is > > >just a pointer swizzle, comparatively inexpensive. > > > > > >So it seems like we have several paths available, none of them wholly > > >satisfactory: > > > > > >1. Essentially what's in the existing PR — the raw pointer variant which > > is > > >"non-standard" > > >2. Pick one and only one for in memory — I think the raw pointer version > > is > > >more useful given that swizzling from index-offset is pretty cheap. But > > the > > >raw pointer version can't be validated safely and is problematic for > e.g. > > >Rust. Picking the index-offset version means that the external ecosystem > > of > > >columnar engines won't be that much closer aligned to Arrow than they > are > > >now. > > >3. Implement the raw pointer variant as an extension type in C++ / C > ABI. > > >This seems potentially useful but given that it would likely be > disfavored > > >for data originating from Arrow-land, there would be fewer scenarios > where > > >zero-copy interop for strings is achieved > > > > > >This is difficult and I don't know what the best answer is, but > personally > > >my inclination has been toward choices that are utilitarian and help > with > > >alignment and cohesion in the open source ecosystem. > > > > > >- Wes > > > > > >On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > >> > > >> To make things clear, any of the factory functions listed below > create a > > >> type that maps exactly onto an Arrow columnar layout: > > >> > > > https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions > > >> > > >> For example, calling `arrow::dictionary` creates a dictionary type > that > > >> exactly represents the dictionary layout specified in > > >> > > >> > > > https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout > > >> > > >> Similarly, if you use any of the builders listed below, what you will > > >> get at the end is data that complies with the Arrow columnar > > specification: > > >> https://arrow.apache.org/docs/dev/cpp/api/builder.html > > >> > > >> All the core Arrow C++ APIs create and process data which complies > with > > >> the Arrow specification, and which is interoperable with other Arrow > > >> implementations. > > >> > > >> Conversely, non-Arrow data such as CSV or Parquet (or Python lists, > > >> etc.) goes through dedicated converters. There is no ambiguity. > > >> > > >> > > >> Creating top-level utilities that create non-Arrow data introduces > > >> confusion and ambiguity as to what Arrow is. Users who haven't studied > > >> the spec in detail - which is probably most users of Arrow > > >> implementations - will call `arrow::string_view(raw_pointers=true)` > and > > >> might later discover that their data cannot be shared with other > > >> implementations (or, if it can, there will be an unsuspected > conversion > > >> cost at the edge). > > >> > > >> It also creates a risk of introducing a parallel Arrow-like ecosystem > > >> based on the superset of data layouts understood by Arrow C++. People > > >> may feel encouraged to code for that ecosystem, pessimizing > > >> interoperability with non-C++ runtimes. > > >> > > >> Which is why I think those APIs, however convenient, also go against > the > > >> overarching goals of the Arrow project. > > >> > > >> > > >> If we want to keep such convenience APIs as part of Arrow C++, they > > >> should be clearly flagged as being non-Arrow compliant. > > >> > > >> It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by > > >> specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`). > > >> > > >> But, they could be also be provided by a distinct library. > > >> > > >> Regards > > >> > > >> Antoine. > > >> > > >> > > >> > > >> Le 28/09/2023 à 09:01, Antoine Pitrou a écrit : > > >> > > > >> > Hi Ben, > > >> > > > >> > Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit : > > >> >> > > >> >> @Antoine > > >> >>> What this PR is creating is an "unofficial" Arrow format, with > data > > >> >> types exposed in Arrow C++ that are not part of the Arrow standard, > > but > > >> >> are exposed as if they were. > > >> >> > > >> >> We already do this in every implementation of the arrow format I'm > > >> >> aware of: it's more convenient to consider dictionary as a data > type > > >> >> even though the spec says that it is a field property. > > >> > > > >> > I'm not sure I understand your point. Dictionary encoding is part of > > the > > >> > Arrow spec, and considering it as a data type is an API choice that > > does > > >> > not violate the spec. > > >> > > > >> > Raw pointers in string views is just not an Arrow format. > > >> > > > >> > Regards > > >> > > > >> > Antoine. > > >> > > >