Re: [DISCUSS][C++] Raw pointer string views

2023-10-07 Thread Andrew Lamb
Given the discussion on this thread, I think the best thing we could do is
1. Do not change the Arrow spec / C++ implementation (do not add raw
pointers)
2. Abandon the goal of "truly zero copy" interchange with Velox and DuckDB
as unobtainable
3. Focus our efforts as a community to drive the new additions to Arrow
through its various ecosystems.

My rationale is that converting DuckDB/Velox style strings to the new Arrow
string view via pointer swizzling is almost certainly less costly than also
copying all the string data as required with the original String Array. So
while the conversion is not "zero" copy it would be "zero string data copy"
which might be good enough, and is certainly better than forcing most users
to convert to the original String Array, as  required until the new String
View is implemented across the ecosystem .

I believe broad support for the newly updated Arrow String view offers far
better performance/effort tradeoff (as a community) than optimizing for the
last bits of conversion, though for certain usecases the tradeoff may be
different.

Andrew

On Fri, Oct 6, 2023 at 12:35 PM Neal Richardson 
wrote:

> Agreed, it's unfortunately not just a simple tradeoff. We have discussed
> this a bit in [1] and in several other threads around this topic. If we say
> that Arrow is about interchange and not execution, so we shouldn't adopt
> the pointer version that DuckDB uses, that means we're also making
> interchange harder because of the need to convert from your internal format
> to the Arrow format at the boundary. Adding the pointer version to the
> arrow format solves that, but creates costs elsewhere.
>
> IIUC Ben's proposal tried to solve this tension by making it possible for
> two systems to agree to use the pointer version and pass data without
> serialization costs. That comes with its own risks and tradeoffs.
>
> This feels like another case where the "canonical alternative layout"
> discussed in [1] could be a way to formalize this variation and allow it to
> be used but not required in all implementations. One way or another, we
> need to find a way to balance the desire for Arrow to be a universal
> standard with the risk of diluting the standard to accommodate every
> project.
>
> Neal
>
> [1]: https://lists.apache.org/thread/djl9dbd7qmozxtjpfzby40gg23x0o3wo
>
> On Fri, Oct 6, 2023 at 11:47 AM Weston Pace  wrote:
>
> > > I feel the broader question here is what is Arrow's intended use case -
> > interchange or execution
> >
> > The line between interchange and execution is not always clear.  For
> > example, I think we would like Arrow to be considered as a standard for
> UDF
> > libraries.
> >
> > On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt 
> wrote:
> >
> > > For the index vs pointer question - DuckDB went with pointers as they
> are
> > > more flexible, and DuckDB was designed to consume data (and strings)
> > from a
> > > wide variety of formats in a wide variety of languages. Pointers allows
> > us
> > > to easily zero-copy from e.g. Python strings, R strings, Arrow strings,
> > > etc. The flip side of pointers is that ownership has to be handled very
> > > carefully. Our vector format is an execution-only format, and never
> > leaves
> > > the internals of the engine. This greatly simplifies ownership as we
> are
> > in
> > > complete control of what happens inside the engine. For an interchange
> > > format that is intended for handing data between engines, I can see
> this
> > > being more complicated and having verification being more important.
> > >
> > > As for the actual change:
> > >
> > > From an interchange perspective from DuckDB's side - the proposed
> > > zero-copy integration would definitely speed up the conversion of
> DuckDB
> > > string vectors to Arrow string vectors. In a recent benchmark that we
> > have
> > > performed we have found string conversion to Arrow vectors to be a
> > > bottleneck in certain workloads, although we have not sufficiently
> > > researched if this could be improved in other ways. It is possible this
> > can
> > > be alleviated without requiring changes to Arrow.
> > >
> > > However - in general, a new string vector format is only useful if
> > > consumers also support the format. If the consumer immediately converts
> > the
> > > strings back into the standard Arrow string representation then there
> is
> > no
> > > benefit. The change will only move where the conversion happens (from
> > > inside DuckDB to inside the consumer). As such, this change is only
> > useful
> > > if the broader Arrow ecosystem moves towards supporting the new string
> > > format.
> > >
> > > From an execution perspective from DuckDB's side - it is unlikely that
> we
> > > will switch to using Arrow as an internal format at this stage of the
> > > project. While this change increases Arrow's utility as an intermediate
> > > execution format, that is more relevant to projects that are currently
> > > using Arrow in this manner or are planning 

Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Jonathan Keane
+1

-Jon


On Sat, Oct 7, 2023 at 3:54 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> +1
>
> On Sat, 7 Oct 2023 at 10:44, Antoine Pitrou  wrote:
> >
> >
> > +1 from me.
> >
> > But I also reiterate my plea that these existing parsers get fixed so as
> > to entirely validate the format string instead of stopping early.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 06/10/2023 à 23:26, Felipe Oliveira Carvalho a écrit :
> > > Hello,
> > >
> > > I'm writing to propose "+vl" and "+vL" as format strings for list-view
> and
> > > large list-view arrays passing through the Arrow C data interface [1].
> > >
> > > The previous proposal was considered a bad idea because existing
> parsers of
> > > these format strings might be looking at only the first `l` (or `L`)
> after
> > > the `+` and assuming the classic list format from that alone, so now
> I'm
> > > proposing we start with a `+v` as this prefix is not shared with any
> other
> > > existing type so far.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 - I'm in favor of this new C Data Format string
> > > [ ] +0
> > > [ ] -1 - I'm against adding this new format string because
> > >
> > > Thanks everyone!
> > >
> > > --
> > > Felipe
> > >
> > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > >
>


Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Joris Van den Bossche
+1

On Sat, 7 Oct 2023 at 10:44, Antoine Pitrou  wrote:
>
>
> +1 from me.
>
> But I also reiterate my plea that these existing parsers get fixed so as
> to entirely validate the format string instead of stopping early.
>
> Regards
>
> Antoine.
>
>
> Le 06/10/2023 à 23:26, Felipe Oliveira Carvalho a écrit :
> > Hello,
> >
> > I'm writing to propose "+vl" and "+vL" as format strings for list-view and
> > large list-view arrays passing through the Arrow C data interface [1].
> >
> > The previous proposal was considered a bad idea because existing parsers of
> > these format strings might be looking at only the first `l` (or `L`) after
> > the `+` and assuming the classic list format from that alone, so now I'm
> > proposing we start with a `+v` as this prefix is not shared with any other
> > existing type so far.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 - I'm in favor of this new C Data Format string
> > [ ] +0
> > [ ] -1 - I'm against adding this new format string because
> >
> > Thanks everyone!
> >
> > --
> > Felipe
> >
> > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> >


Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Antoine Pitrou



+1 from me.

But I also reiterate my plea that these existing parsers get fixed so as 
to entirely validate the format string instead of stopping early.


Regards

Antoine.


Le 06/10/2023 à 23:26, Felipe Oliveira Carvalho a écrit :

Hello,

I'm writing to propose "+vl" and "+vL" as format strings for list-view and
large list-view arrays passing through the Arrow C data interface [1].

The previous proposal was considered a bad idea because existing parsers of
these format strings might be looking at only the first `l` (or `L`) after
the `+` and assuming the classic list format from that alone, so now I'm
proposing we start with a `+v` as this prefix is not shared with any other
existing type so far.

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--
Felipe

[1] https://arrow.apache.org/docs/format/CDataInterface.html



Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Richard Haven
Unsubscribe

On Fri, Oct 6, 2023, 14:26 Felipe Oliveira Carvalho 
wrote:

> Hello,
>
> I'm writing to propose "+vl" and "+vL" as format strings for list-view and
> large list-view arrays passing through the Arrow C data interface [1].
>
> The previous proposal was considered a bad idea because existing parsers of
> these format strings might be looking at only the first `l` (or `L`) after
> the `+` and assuming the classic list format from that alone, so now I'm
> proposing we start with a `+v` as this prefix is not shared with any other
> existing type so far.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 - I'm in favor of this new C Data Format string
> [ ] +0
> [ ] -1 - I'm against adding this new format string because
>
> Thanks everyone!
>
> --
> Felipe
>
> [1] https://arrow.apache.org/docs/format/CDataInterface.html
>