Re: [DISCUSS][Format] Draft implementation of string view array format

Matt Topol Mon, 31 Jul 2023 14:25:08 -0700

To chime in from the Go side, all Go memory is typically
automatically zero-ed out anyways, so by default the Go implementation will
also fit this invariant with all of the buffers being zero-ed out.


On Mon, Jul 31, 2023 at 5:18 PM Pedro Eugenio Rocha Pedreira
<pedro...@meta.com.invalid> wrote:

> > My only major comment so far is that views for null slots I think should
> be
> > well defined, that is they shouldn't reference buffers or data that
> > doesn't exist. This is very important for the Rust implementation to be
> > able to provide efficient safe APIs.
>
> Quickly chiming in here again: FWIW, we had a similar discussion for
> Velox, and
> eventually settled on the invariant that elements should always be
> copiable,
> in order to make the codebase simpler and less error prone. We keep this
> invariant throughout the engine. This essentially means that
> Utf8View/BinaryView
> elements are always correctly initialized by zeroing out their lengths,
> and that
> complex types have sane offsets/lengths values (related to the parallel
> thread on
> ListViews).
>
> Best,
> --
> Pedro Pedreira
> ________________________________
> From: Raphael Taylor-Davies <r.taylordav...@googlemail.com.INVALID>
> Sent: Monday, July 31, 2023 12:50 AM
> To: dev@arrow.apache.org <dev@arrow.apache.org>
> Subject: Re: [DISCUSS][Format] Draft implementation of string view array
> format
>
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
>
> |-------------------------------------------------------------------!
>
> Hi All,
>
> Having played around with various alternatives, I think we should move
> ahead with this proposal. I think Utf8View and BinaryView would be a
> valuable addition to the arrow specification, and would address many of
> the current pain points encountered when dealing with large string
> payloads or string dictionaries. In particular, they avoid the need to
> copy string data in all bar exceptional circumstances, whilst also
> supporting string interning in a manner that doesn't require recomputing
> dictionaries when combining data streams. Thank you all for the
> insightful discussion.
>
> In that vein I have started implementing them within arrow-rs [1]. My
> only major comment so far is that views for null slots I think should be
> well defined, that is they shouldn't reference buffers or data that
> doesn't exist. This is very important for the Rust implementation to be
> able to provide efficient safe APIs.
>
> Otherwise I don't see any major blockers towards getting these
> standardised, thank you for your continued efforts in this space.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> [1]: https://github.com/apache/arrow-rs/pull/4585
>
> On 12/07/2023 17:35, Pedro Eugenio Rocha Pedreira wrote:
> > Hi all, this is Pedro from the Velox team at Meta. Chiming in here to
> add a bit more context from our side.
> >
> >> I'm not sure the problem here is a lack of understanding or maturity. In
> >> fact, it would be much easier if this was just a problem of education
> but
> >> it is not.
> > Adding to what Weston said, when Velox started the intent was to have it
> built on top of vanilla Arrow, if not directly using the Arrow C++ library,
> at least having an implementation conforming to the data layout. The fact
> we decided to "extended" the format into what we internally call Velox
> Vectors was very much a deliberate decision. If Arrow had at the time
> support for StringView, ListViews (discussed recently in a separate thread,
> related to supporting out-of-order writes), and more encodings
> (specifically RLE and Constant, which can now be done through new REE),
> Velox would have used it from the beginning. The rationale behind these
> extensions has been discussed here, but there is more context in our paper
> last year [0] (check Section 4.2).
> >
> >> I can't speak for all query engines, but at least in the case of
> >> DataFusion we exclusively use the Arrow format as the interchange format
> >> between operators, including for UDFs. We have found that for most
> >> operators operating directly on the Arrow format is sufficiently
> >> performant to not represent a query bottleneck.
> > This always comes down to your workloads and operations you are
> optimizing for. We found that these extensions were crucial for our
> workloads, particularly for efficient vectorized expression evaluation.
> Other modern engines like DuckDB and Umbra had comparable findings and
> deviate from Arrow in similar ways.
> >
> >> Is Arrow meant to only be used in between systems (in this case query
> >> engines) or is it also meant to be used in between components of a query
> >> engine?
> > Our findings were that if you really care about performance, Arrow today
> is (unfortunately) not sufficient to support the needs of a
> state-of-the-art execution engine. Raphael raised a good point above about
> Arrow potentially favoring interoperability over performance. In that case,
> I believe Arrow's usage would be restricted to communication between
> systems, which is about what we see in practice today.
> >
> > However, even between system boundaries this is becoming a hurdle. A
> recent example is a new project called Gluten [1], which integrates Velox
> into Spark (akin to Databrick's Photon). The JNI data communication between
> Velox and Java is done using Arrow C ABI. We found that the
> incompatibilities that result in non-zero-copy transfer (back to
> StringViews) show up high enough as a bottleneck that the team is
> considering using Velox Vectors as the exchange format to go around this
> limitation. We argued with the Gluten team to stick with Arrow with the
> hope we would work with the community to get these extensions incorporated
> into the standard. Another example is a bridge between Velox and DuckDB
> built inside Velox that converts these formats directly from one another.
> We don't use Arrow there for similar reasons. There are other similar
> examples between PyTorch's pythonic ecosystem and Velox.
> >
> >> I therefore also wonder about the possibility of always having a single
> >> backing buffer that stores the character data, including potentially a
> copy
> >> of the prefix.
> > There are quite a few use cases for this within Velox and other modern
> engines, but AFAIK they boil down to two reasons:
> >
> > 1. When you're creating string buffers you usually don't know the size
> of each string beforehand. This allows us to allocate the buffer
> page-by-page as you go, then capturing these buffers without having to
> reallocate and copy them. The simplest use case here is a vectorized
> function that generates a string as output.
> > 2. It allows us to create StringView arrays without copying the
> underlying data. Use cases here are the receiving end of an exchange
> (shuffle), or a simple table scan. In the former, you can capture the pages
> you directly read from the socket without copying them (we internally use a
> format called IOBuf from folly, which is sort of a linked list of buffers).
> In the latter, the pages are usually cached in a memory cache, so you can
> simply acquire pointers to these page caches. All of that to prevent
> unnecessary copies.
> >
> >
> > [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> > [1] - https://github.com/oap-project/gluten
> >
> > Best,
> >
> > [
> https://opengraph.githubassets.com/7251524e6ed3ec08793b7dd25fd62e13a4e0f7f985179d5c9f53c9d9fc365ba0/oap-project/gluten
> ]<https://github.com/oap-project/gluten>
> > GitHub - oap-project/gluten: Gluten: Plugin to Double SparkSQL's
> Performance<https://github.com/oap-project/gluten>
> > Gluten: Plugin to Double SparkSQL's Performance. Contribute to
> oap-project/gluten development by creating an account on GitHub.
> > github.com
> > 
> > 
> > 
> > 
> >
> >
> > --
> > Pedro Pedreira
> > ________________________________
> > From: Weston Pace <weston.p...@gmail.com>
> > Sent: Tuesday, July 11, 2023 8:42 AM
> > To: dev@arrow.apache.org <dev@arrow.apache.org>
> > Subject: Re: [DISCUSS][Format] Draft implementation of string view array
> format
> >
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >
> > |-------------------------------------------------------------------!
> >
> >> I definitely hope that with time Arrow will penetrate deeper into these
> >> engines, perhaps in a similar manner to DataFusion, as opposed to
> >> primarily existing at the surface-level.
> > I'm not sure the problem here is a lack of understanding or maturity.  In
> > fact, it would be much easier if this was just a problem of education but
> > it is not.  Velox is already using Arrow formats (internally, not just at
> > the boundaries) throughout their system.  Same with DuckDb.  Both teams
> are
> > very familiar with Arrow.  They have just made the decision that the
> string
> > format is not the correct format to use in a query engine for string
> data.
> >
> > I don't think there is a single best layout for all situations.  I don't
> > think we want to be in the game of declaring which layout is the correct
> > layout for query engines.  I suspect there will be cases where the
> > appropriate layout is highly dependent on the workload.
> >
> >> I just wonder if inclusion in the primary
> >> standard is really the right place for them. Perhaps some extension
> >> mechanism might be the way to go here, potentially with some negotiation
> >> mechanism, I'm not really sure
> >> I agree 100% that this sort of interoperability is what makes Arrow so
> >> compelling and something we should work very hard to preserve. This is
> >> the crux of my concern with standardising alternative layouts
> > I did have a proposal for alternative layouts when we were discussing the
> > array view layout.  I'll repeat it here for completeness:
> >
> >   * There are one or more primary layouts
> >     * Existing layouts are automatically considered primary layouts,
> even if
> > they wouldn't
> >       have been primary layouts initially (e.g. large list)
> >   * A new layout, if it is semantically equivalent to another, is
> considered
> > an alternative layout
> >   * An alternative layout still has the same requirements for adoption
> (two
> > implementations and a vote)
> >     * An implementation should not feel pressured to rush and implement
> the
> > new layout.
> >       It would be good if they contribute in the discussion and consider
> the
> > layout and vote
> >       if they feel it would be an acceptable design.
> >   * We can define and vote and approve as many canonical alternative
> layouts
> > as we want:
> >     * A canonical alternative layout should, at a minimum, have some
> >       reasonable justification, such as improved performance for
> algorithm X
> >   * Arrow implementations MUST support the primary layouts
> >   * An Arrow implementation MAY support a canonical alternative, however:
> >     * An Arrow implementation MUST first support the primary layout
> >     * An Arrow implementation MUST support conversion to/from the primary
> > and canonical layout
> >     * An Arrow implementation's APIs MUST only provide data in the
> > alternative
> >       layout if it is explicitly asked for (e.g. schema inference should
> > prefer the primary layout).
> >   * We can still vote for new primary layouts (e.g. promoting a canonical
> > alternative) but, in these
> >      votes we don't only consider the value (e.g. performance) of the
> layout
> > but also the interoperability.
> >      In other words, a layout can only become a primary layout if there
> is
> > significant evidence that most
> >      implementations plan to adopt it.
> >
> > On Mon, Jul 10, 2023 at 9:49 AM Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> >>> For example, if someone (datafusion, velox, etc.) were to come up with
> a
> >>> framework for UDFs then would batches be passed in and out of those
> UDFs
> >> in
> >>> the Arrow format?
> >> Yes, I think the arrow format is a perfect fit for this
> >>> Is Arrow meant to only be used in between systems (in this case query
> >>> engines) or is it also meant to be used in between components of a
> query
> >>> engine?
> >> I can't speak for all query engines, but at least in the case of
> >> DataFusion we exclusively use the Arrow format as the interchange format
> >> between operators, including for UDFs. We have found that for most
> >> operators operating directly on the Arrow format is sufficiently
> >> performant to not represent a query bottleneck. For others, such as
> >> joins, sorts and aggregates, we do make use of bespoke data structures
> >> and formats internally, e.g. hash tables, row formats, etc..., but the
> >> operator's public APIs are still in terms of arrow RecordBatch. We have
> >> found this approach to perform very well, whilst also providing very
> >> good modularity and composability.
> >>
> >> In fact we are actually currently in the process of migrating the
> >> aggregation logic away from a bespoke mutable row representation to the
> >> Arrow model, and are already seeing significant performance
> >> improvements, not to mention a significant reduction in code complexity
> >> and improved composability [1].
> >>
> >>> If every engine has its own bespoke formats internally
> >>> then it seems we are placing a limit on how far things can be
> decomposed.
> >> Agreed, if engines choose to implement operations on bespoke formats,
> >> these operations will likely not be as interoperable as those
> >> implemented using Arrow. To what extent an engine favours their own
> >> format(s) over Arrow will be an engineering trade-off they will have to
> >> make, but DataFusion has found exclusively using Arrow as the
> >> interchange format between operators to work well.
> >>
> >>> There are now multiple implementations of a query
> >>> engine and I think we are seeing just the edges of this query engine
> >>> decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> >> consuming
> >>> a velox task as a record batch stream into a different system) and
> these
> >>> sorts of challenges are in the forefront.
> >> I agree 100% that this sort of interoperability is what makes Arrow so
> >> compelling and something we should work very hard to preserve. This is
> >> the crux of my concern with standardising alternative layouts. I
> >> definitely hope that with time Arrow will penetrate deeper into these
> >> engines, perhaps in a similar manner to DataFusion, as opposed to
> >> primarily existing at the surface-level.
> >>
> >> [1]: https://github.com/apache/arrow-datafusion/pull/6800
> >>
> >> On 10/07/2023 11:38, Weston Pace wrote:
> >>>> The point I was trying to make, albeit very badly, was that these
> >>>> operations are typically implemented using some sort of row format [1]
> >>>> [2], and therefore their performance is not impacted by the array
> >>>> representations. I think it is both inevitable, and in fact something
> to
> >>>> be encouraged, that query engines will implement their own in-memory
> >>>> layouts and data structures outside of the arrow specification for
> >>>> specific operators, workloads, hardware, etc... This allows them to
> make
> >>>> trade-offs based on their specific application domain, whilst also
> >>>> ensuring that new ideas and approaches can continue to be incorporated
> >>>> and adopted in the broader ecosystem. However, to then seek to
> >>>> standardise these layouts seems to be both potentially unbounded scope
> >>>> creep, and also somewhat counter productive if the goal of
> >>>> standardisation is improved interoperability?
> >>> FWIW, I believe this formats are very friendly for row representation
> as
> >>> well, especially when stored as a payload (e.g. in a join).
> >>>
> >>> For your more general point though I will ask the same question I asked
> >> on
> >>> the ArrayView discussion:
> >>>
> >>> Is Arrow meant to only be used in between systems (in this case query
> >>> engines) or is it also meant to be used in between components of a
> query
> >>> engine?
> >>>
> >>> For example, if someone (datafusion, velox, etc.) were to come up with
> a
> >>> framework for UDFs then would batches be passed in and out of those
> UDFs
> >> in
> >>> the Arrow format?  If every engine has its own bespoke formats
> internally
> >>> then it seems we are placing a limit on how far things can be
> decomposed.
> >>>   From the C++ perspective, I would personally like to see Arrow be
> usable
> >>> within components.  There are now multiple implementations of a query
> >>> engine and I think we are seeing just the edges of this query engine
> >>> decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> >> consuming
> >>> a velox task as a record batch stream into a different system) and
> these
> >>> sorts of challenges are in the forefront.
> >>>
> >>> On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies
> >>> <r.taylordav...@googlemail.com.invalid> wrote:
> >>>
> >>>>> Thus the approach you
> >>>>> describe for validating an entire character buffer as UTF-8 then
> >> checking
> >>>>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> >>>> The difference here is that it is perhaps expected for Utf8View to
> have
> >>>> gaps in the underlying data that are not referenced as part of any
> >>>> value, as I had understood this to be one of its benefits over the
> >>>> current encoding. I think it would therefore be problematic to enforce
> >>>> these gaps be UTF-8.
> >>>>
> >>>>> Furthermore unlike an explicit
> >>>>> selection vector a kernel may decide to copy and densify dynamically
> if
> >>>> it
> >>>>> detects that output is getting sparse or fragmented
> >>>> I don't see why you couldn't do something similar to materialize a
> >>>> sparse selection vector, if anything being able to centralise this
> logic
> >>>> outside specific kernels would be advantageous.
> >>>>
> >>>>> Specifically sorting and equality comparison
> >>>>> benefit significantly from the prefix comparison fast path,
> >>>>> so I'd anticipate that multi column sorting and aggregations would as
> >>>> well
> >>>>
> >>>> The point I was trying to make, albeit very badly, was that these
> >>>> operations are typically implemented using some sort of row format [1]
> >>>> [2], and therefore their performance is not impacted by the array
> >>>> representations. I think it is both inevitable, and in fact something
> to
> >>>> be encouraged, that query engines will implement their own in-memory
> >>>> layouts and data structures outside of the arrow specification for
> >>>> specific operators, workloads, hardware, etc... This allows them to
> make
> >>>> trade-offs based on their specific application domain, whilst also
> >>>> ensuring that new ideas and approaches can continue to be incorporated
> >>>> and adopted in the broader ecosystem. However, to then seek to
> >>>> standardise these layouts seems to be both potentially unbounded scope
> >>>> creep, and also somewhat counter productive if the goal of
> >>>> standardisation is improved interoperability? I fully expect in the
> next
> >>>> 5 years someone will come up with an even better way to encode strings
> >>>> for some particular workload or hardware, do we then incorporate that
> as
> >>>> well?
> >>>>
> >>>> I guess it boils down to what matters to people more, interoperability
> >>>> or best-in-class performance? Currently I think it is fair to say both
> >>>> arrow and parquet favour interoperability over performance, aiming to
> >>>> provide good enough performance broadly on the same order of magnitude
> >>>> as a custom solution. I personally think this is the right engineering
> >>>> trade-off, but appreciate opinions may differ. Ultimately I just
> really
> >>>> want arrow to avoid the situation parquet has found itself in, where
> the
> >>>> specification has both far outstripped the ability for the
> >>>> implementations to keep pace, whilst simultaneously having
> standardised
> >>>> approaches for things like delta encoding that are now considered
> >>>> extremely sub-optimal for modern hardware.
> >>>>
> >>>> That all being said I'm not against adding support for these arrays if
> >>>> others are already onboard, I just wonder if inclusion in the primary
> >>>> standard is really the right place for them. Perhaps some extension
> >>>> mechanism might be the way to go here, potentially with some
> negotiation
> >>>> mechanism, I'm not really sure... I will continue to think on this
> >>>>
> >>>> Kind Regards,
> >>>>
> >>>> Raphael
> >>>>
> >>>> [1]:
> >>>>
> >>>>
> >>
> https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison
> >>>> [2]: https://docs.rs/arrow-row/latest/arrow_row/
> >>>>
> >>>> On 06/07/2023 17:47, Benjamin Kietzman wrote:
> >>>>> @Andrew:
> >>>>>
> >>>>> Restricting these arrays to a single buffer will severely decrease
> >> their
> >>>>> utility. Since the character data is stored in multiple character
> >> buffers
> >>>>> writing Utf8View array can proceed without resizing allocations,
> >>>>> which is a major overhead when writing Utf8 arrays. Furthermore since
> >> the
> >>>>> character buffers have no restrictions on their size, it's
> >>>> straightforward
> >>>>> to
> >>>>> reuse an existing buffer as a character buffer rather than always
> >>>> allocating
> >>>>> a new one. In the case of creating an array which shares a lot of
> data
> >>>> with
> >>>>> another (for example, appending some strings) we can reuse most of
> the
> >>>>> character buffers from the original. Finally Utf8View is well adapted
> >> for
> >>>>> efficiently wrapping non-arrow string data for ingestion by a kernel,
> >>>> even
> >>>>> if the string data's full extent is not known ahead of time and is
> >> spread
> >>>>> across multiple non-contiguous buffers.
> >>>>>
> >>>>> @Raphael:
> >>>>>
> >>>>>> branch on access
> >>>>> The branch-on-access is unavoidable since a primary feature of the
> >>>> Utf8View
> >>>>> format is keeping short strings inline in the fixed width portion of
> >>>> data.
> >>>>> It's worth noting that the inline prefix allows skipping the branch
> >>>> entirely
> >>>>> for common cases of comparison, for example when the strings to be
> >>>> compared
> >>>>> differ within the first 4 bytes.
> >>>>>
> >>>>> In benchmarking (for example while building a hash table) I have not
> >>>>> observed
> >>>>> that this branch overly pessimizes access. Although I can't guarantee
> >>>> every
> >>>>> Utf8View array will be more efficient than any Utf8 array, it is
> >>>> certainly
> >>>>> faster for many relevant cases. Specifically sorting and equality
> >>>> comparison
> >>>>> benefit significantly from the prefix comparison fast path,
> >>>>> so I'd anticipate that multi column sorting and aggregations would as
> >>>> well.
> >>>>> If there are any other benchmarks which would help to justify
> Utf8View
> >> in
> >>>>> your
> >>>>> mind, I'd be happy to try writing them.
> >>>>>
> >>>>>> UTF-8 validation for StringArray can be done very efficiently by
> first
> >>>>> verifying the entire buffer, and then verifying the offsets
> correspond
> >> to
> >>>>> the start of a UTF-8 codepoint
> >>>>>
> >>>>> For non-inlined strings, the character buffers do always contain the
> >>>> entire
> >>>>> string's data and not just the last `len - 4` bytes. Thus the
> approach
> >>>> you
> >>>>> describe for validating an entire character buffer as UTF-8 then
> >> checking
> >>>>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> >>>>>
> >>>>>> it does seem inconsistent to use unsigned types
> >>>>> It is indeed more typical for the arrow format to use signed integers
> >> for
> >>>>> offsets and other quantities. In this case there is prior art in
> other
> >>>>> engines with which we can remain compatible by using unsigned
> integers
> >>>>> instead. Since this is only a break with convention within the format
> >> and
> >>>>> shouldn't be difficult for any implementation to accommodate, I would
> >>>> argue
> >>>>> that it's worthwhile to avoid pushing change onto existing
> >> implementers.
> >>>>>> I presume that StringView will behave similarly to dictionaries in
> >> that
> >>>>> the selection kernels will not recompute the underlying value
> buffers.
> >>>>>
> >>>>> The Utf8View format itself is not prescriptive of selection
> operations
> >> on
> >>>>> the
> >>>>> array; kernels are free to reuse character buffers (which produces an
> >>>>> implicit
> >>>>> selection vector) or to recompute them. Furthermore unlike an
> explicit
> >>>>> selection vector a kernel may decide to copy and densify dynamically
> if
> >>>> it
> >>>>> detects that output is getting sparse or fragmented. It's also worth
> >>>> noting
> >>>>> that unlike an explicit selection vector a Utf8View array (however
> >>>> sparse or
> >>>>> fragmented) will still benefit from the prefix comparison fast path.
> >>>>>
> >>>>> Sincerely,
> >>>>> Ben Kietzman
> >>>>>
> >>>>> On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
> >>>>> <r.taylordav...@googlemail.com.invalid>  wrote:
> >>>>>
> >>>>>>> I would be interested in hearing some input from the Rust
> community.
> >>>>>>     A couple of thoughts:
> >>>>>>
> >>>>>> The variable number of buffers would definitely pose some challenges
> >> for
> >>>>>> the Rust implementation, the closest thing we currently have is
> >> possibly
> >>>>>> UnionArray, but even then the number of buffers is still determined
> >>>>>> statically by the DataType. I therefore also wonder about the
> >>>> possibility
> >>>>>> of always having a single backing buffer that stores the character
> >> data,
> >>>>>> including potentially a copy of the prefix. This would also avoid
> >>>> forcing a
> >>>>>> branch on access, which I would have expected to hurt performance
> for
> >>>> some
> >>>>>> kernels quite significantly.
> >>>>>>
> >>>>>> Whilst not really a concern for Rust, which supports unsigned types,
> >> it
> >>>>>> does seem inconsistent to use unsigned types where the rest of the
> >>>> format
> >>>>>> encourages the use of signed offsets, etc...
> >>>>>>
> >>>>>> It isn't clearly specified whether a null should have a valid set of
> >>>>>> offsets, etc... I think it is an important property of the current
> >> array
> >>>>>> layouts that, with exception to dictionaries, the data in null slots
> >> is
> >>>>>> arbitrary, i.e. can take any value, but not undefined. This allows
> for
> >>>>>> separate handling of the null mask and values, which can be
> important
> >>>> for
> >>>>>> some kernels and APIs.
> >>>>>>
> >>>>>> More an observation than an issue, but UTF-8 validation for
> >> StringArray
> >>>>>> can be done very efficiently by first verifying the entire buffer,
> and
> >>>> then
> >>>>>> verifying the offsets correspond to the start of a UTF-8 codepoint.
> >> This
> >>>>>> same approach would not be possible for StringView, which would need
> >> to
> >>>>>> verify individual values and would therefore be significantly more
> >>>>>> expensive. As it is UB for a Rust string to contain non-UTF-8 data,
> >> this
> >>>>>> validation is perhaps more important for Rust than for other
> >> languages.
> >>>>>> I presume that StringView will behave similarly to dictionaries in
> >> that
> >>>>>> the selection kernels will not recompute the underlying value
> >> buffers. I
> >>>>>> think this is fine, but it is perhaps worth noting this has caused
> >>>>>> confusion in the past, as people somewhat reasonably expect an array
> >>>>>> post-selection to have memory usage reflecting the smaller
> selection.
> >>>> This
> >>>>>> is then especially noticeable if the data is written out to IPC, and
> >>>> still
> >>>>>> contains data that was supposedly filtered out. My 2 cents is that
> >>>> explicit
> >>>>>> selection vectors are a less surprising way to defer selection than
> >>>> baking
> >>>>>> it into the array, but I also don't have any workloads where this is
> >> the
> >>>>>> major bottleneck so can't speak authoritatively here.
> >>>>>>
> >>>>>> Which leads on to my major concern with this proposal, that it adds
> >>>>>> complexity and cognitive load to the specification and
> >> implementations,
> >>>>>> whilst not meaningfully improving the performance of the operators
> >> that
> >>>> I
> >>>>>> commonly encounter as performance bottlenecks, which are
> multi-column
> >>>> sorts
> >>>>>> and aggregations, or the expensive string operations such as
> matching
> >> or
> >>>>>> parsing. If we didn't already have a string representation I would
> be
> >>>> more
> >>>>>> onboard, but as it stands I'm definitely on the fence, especially
> >> given
> >>>>>> selection performance can be improved in less intrusive ways using
> >>>>>> dictionaries or selection vectors.
> >>>>>>
> >>>>>> Kind Regards,
> >>>>>>
> >>>>>> Raphael Taylor-Davies
> >>>>>>
> >>>>>> On 02/07/2023 11:46, Andrew Lamb wrote:
> >>>>>>
> >>>>>>     * This is the first layout where the number of buffers depends
> on
> >> the
> >>>>>> data
> >>>>>>
> >>>>>> and not the schema. I think this is the most architecturally
> >> significant
> >>>>>> fact. I
> >>>>>>
> >>>>>>     I have spent some time reading the initial proposal -- thank you
> >> for
> >>>>>> that. I now understand what Weston was saying about the "variable
> >>>> numbers
> >>>>>> of buffers". I wonder if you considered restricting such arrays to a
> >>>> single
> >>>>>> buffer (so as to make them more similar to other arrow array types
> >> that
> >>>>>> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM
> >> Weston
> >>>>>> Pace<weston.p...@gmail.com>  <mailto:weston.p...@gmail.com>  wrote:
> >>>>>>
> >>>>>> Before I say anything else I'll say that I am in favor of this new
> >>>> layout.
> >>>>>> There is some existing literature on the idea (e.g. umbra) and your
> >>>>>> benchmarks show some nice improvements. Compared to some of the
> other
> >>>>>> layouts we've discussed recently (REE, list veiw) I do think this
> >>>> layout is
> >>>>>> more unique and fundamentally different. Perhaps most fundamentally
> >>>>>> different: * This is the first layout where the number of buffers
> >>>> depends
> >>>>>> on the data and not the schema. I think this is the most
> >> architecturally
> >>>>>> significant fact. It does require a (backwards compatible) change to
> >> the
> >>>>>> IPC format itself, beyond just adding new type codes. It also poses
> >>>>>> challenges in places where we've assumed there will be at most 3
> >> buffers
> >>>>>> (e.g. in ArraySpan, though, as you have shown, we can work around
> this
> >>>>>> using a raw pointers representation internally in those spots). I
> >> think
> >>>>>> you've done some great work to integrate this well with Arrow-C++
> and
> >>>> I'm
> >>>>>> convinced it can work. I would be interested in hearing some input
> >> from
> >>>> the
> >>>>>> Rust community. Ben, at one point there was some discussion that
> this
> >>>> might
> >>>>>> be a c-data only type. However, I believe that was based on the raw
> >>>>>> pointers representation. What you've proposed here, if I understand
> >>>>>> correctly, is an index + offsets representation and it is suitable
> for
> >>>> IPC
> >>>>>> correct? (e.g. I see that you have changes and examples in the IPC
> >>>>>> reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <
> >>>>>> bengil...@gmail.com> <mailto:bengil...@gmail.com>  wrote:
> >>>>>>
> >>>>>> Hi Gang, I'm not sure what you mean, sorry if my answers are off
> base:
> >>>>>> Parquet's ByteArray will be unaffected by the addition of the string
> >>>> view
> >>>>>> type; all arrow strings
> >> (arrow::Type::STRING,arrow::Type::LARGE_STRING,
> >>>>>> and with this patcharrow::Type::STRING_VIEW) are converted to
> >> ByteArrays
> >>>>>> during serialization to parquet [1]. If you mean that encoding of
> >>>>>> arrow::Type::STRING_VIEW  will not be as fast as encoding of
> >> equivalent
> >>>>>> arrow::Type::STRING, that's something I haven't benchmarked so I
> can't
> >>>>>> answer definitively. I would expect it to be
> >>>>>>
> >>>>>> faster
> >>>>>>
> >>>>>> than first converting STRING_VIEW->STRING then encoding to parquet;
> >>>> direct
> >>>>>> encoding avoids allocating and populating temporary buffers. Of
> course
> >>>>>>
> >>>>>> this
> >>>>>>
> >>>>>> only applies to cases where you need to encode an array of
> STRING_VIEW
> >>>> to
> >>>>>> parquet- encoding of STRING to parquet will be unaffected.
> Sincerely,
> >>>> Ben
> >>>>>> [1]
> >>>>>>
> >>>>>>
> >>>>>>
> >>
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
> >>>>>>     On Thu, Jun 15, 2023 at 10:34 PM Gang Wu<ust...@gmail.com>
> >> <mailto:
> >>>> ust...@gmail.com>  wrote:
> >>>>>> Hi Ben, The posted benchmark [1] looks pretty good to me. However, I
> >>>> want
> >>>>>> to raise a possible issue from the perspective of parquet-cpp.
> >>>> Parquet-cpp
> >>>>>> uses a customizedparquet::ByteArray  type [2] for string/binary, I
> >>>>>>
> >>>>>> would
> >>>>>>
> >>>>>> expect some regression of conversions between parquet reader/writer
> >> and
> >>>>>> the proposed string view array, especially when some strings use
> short
> >>>> form
> >>>>>> and others use long form. [1]
> >>>>>>
> >>>>>>
> >>>>>>
> >>
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> >>>>>> [2]
> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> >>>>>> Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones <
> >>>>>> will.jones...@gmail.com> <mailto:will.jones...@gmail.com>  wrote:
> >>>>>>
> >>>>>> Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin
> >>>>>> Kietzman <bengil...@gmail.com  <mailto:bengil...@gmail.com>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> I've addedhttps://github.com/apache/arrow/issues/36112  to track
> >>>>>> deduplication of buffers on write. I don't think it would require
> >>>>>> modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM
> >> Matt
> >>>>>> Topol <zotthewiz...@gmail.com  <mailto:zotthewiz...@gmail.com>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Based on my understanding, in theory a buffer *could* be shared
> >>>>>>
> >>>>>> within
> >>>>>>
> >>>>>> a
> >>>>>>
> >>>>>> batch since the flatbuffers message just uses an offset and
> >>>>>>
> >>>>>> length
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> identify the buffers. That said, I don't believe any current
> >>>>>> implementation actually
> >>>>>>
> >>>>>> does
> >>>>>>
> >>>>>> this
> >>>>>>
> >>>>>> or
> >>>>>>
> >>>>>> takes advantage of this in any meaningful way. --Matt On Thu, Jun
> 15,
> >>>> 2023
> >>>>>> at 1:00 PM Will Jones <
> >>>>>>
> >>>>>> will.jones...@gmail.com  <mailto:will.jones...@gmail.com>>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi Ben, It's exciting to see this move along. The buffers will be
> >>>>>> duplicated. If buffer duplication is
> >>>>>>
> >>>>>> becomes
> >>>>>>
> >>>>>> a
> >>>>>>
> >>>>>> concern,
> >>>>>>
> >>>>>> I'd prefer to handle that in the ipc writer. Then buffers which are
> >>>>>> duplicated
> >>>>>>
> >>>>>> could
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> detected
> >>>>>>
> >>>>>> by checking pointer identity and written only once.
> >>>>>>
> >>>>>>     Question: to be able to write buffer only once and reference in
> >>>>>>
> >>>>>> multiple
> >>>>>>
> >>>>>> arrays, does that require a change to the IPC format? Or is
> >>>>>>
> >>>>>> sharing
> >>>>>>
> >>>>>> buffers
> >>>>>>
> >>>>>> within the same batch already allowed in the IPC format? Best, Will
> >>>> Jones
> >>>>>> On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
> >>>>>>
> >>>>>> bengil...@gmail.com  <mailto:bengil...@gmail.com>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hello again all, The PR [1] to add string view to the format and the
> >> C++
> >>>>>> implementation
> >>>>>>
> >>>>>> is
> >>>>>>
> >>>>>> hovering around passing CI and has been undrafted.
> >>>>>>
> >>>>>> Furthermore,
> >>>>>>
> >>>>>> there
> >>>>>>
> >>>>>> is
> >>>>>>
> >>>>>> now also a PR [2] to add string view to the Go
> >>>>>>
> >>>>>> implementation.
> >>>>>>
> >>>>>> Code
> >>>>>>
> >>>>>> review
> >>>>>>
> >>>>>> is underway for each PR and I'd like to move toward a vote
> >>>>>>
> >>>>>> for
> >>>>>>
> >>>>>> acceptance-
> >>>>>>
> >>>>>> are there any other preliminaries which I've neglected? To reiterate
> >> the
> >>>>>> answers to some past questions: - Benchmarks are added in the C++ PR
> >>>> [1] to
> >>>>>> demonstrate the
> >>>>>>
> >>>>>> performance
> >>>>>>
> >>>>>> of
> >>>>>>
> >>>>>>     conversion between the various string formats. In addition,
> >>>>>>
> >>>>>> there
> >>>>>>
> >>>>>> are
> >>>>>>
> >>>>>>     some benchmarks which demonstrate the performance gains
> >>>>>>
> >>>>>> available
> >>>>>>
> >>>>>> with
> >>>>>>
> >>>>>>     the new format [3]. - Adding string view to the C ABI is a
> natural
> >>>> follow
> >>>>>> up, but
> >>>>>>
> >>>>>> should
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>>     handled independently. An issue has been added to track
> >>>>>>
> >>>>>> that
> >>>>>>
> >>>>>>     enhancement [4]. Sincerely, Ben Kietzman [1]
> >>>>>> https://github.com/apache/arrow/pull/35628  [2]
> >>>>>> https://github.com/apache/arrow/pull/35769  [3]
> >>>>>>
> >>>>>> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> >>>>>>
> >>>>>> [4]https://github.com/apache/arrow/issues/36099  On Wed, May 17,
> 2023
> >>>> at
> >>>>>> 12:53 PM Benjamin Kietzman <
> >>>>>>
> >>>>>> bengil...@gmail.com  <mailto:bengil...@gmail.com>>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> @Jacob
> >>>>>>
> >>>>>> You mention benchmarks multiple times, are these results
> >>>>>>
> >>>>>> published
> >>>>>>
> >>>>>> somewhere? I benchmarked the performance of raw pointer vs index
> >>>>>>
> >>>>>> offset
> >>>>>>
> >>>>>> views
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> my
> >>>>>>
> >>>>>> PR to velox, I do intend to port them to my arrow PR but I haven't
> >>>>>>
> >>>>>> gotten
> >>>>>>
> >>>>>> there
> >>>>>>
> >>>>>> yet.
> >>>>>>
> >>>>>> Furthermore, it seemed less urgent to me since coexistence of the
> two
> >>>>>> types
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> c++
> >>>>>>
> >>>>>> implementation defers the question of how aggressively one should be
> >>>>>>
> >>>>>> preferred
> >>>>>>
> >>>>>> over
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> other. @Dewey
> >>>>>>
> >>>>>> I don't see the C Data interface in the PR
> >>>>>>
> >>>>>>     I have not addressed the C ABI in this PR. As you mention,
> >>>>>>
> >>>>>> it
> >>>>>>
> >>>>>> may
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> useful to transmit arrays with raw pointer views between
> >> implementations
> >>>>>> which
> >>>>>>
> >>>>>> allow
> >>>>>>
> >>>>>> them. I
> >>>>>>
> >>>>>> can address this in a follow up PR. @Will
> >>>>>>
> >>>>>> If I understand correctly, multiple arrays can reference
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> same
> >>>>>>
> >>>>>> buffers
> >>>>>>
> >>>>>> in memory, but once they are written to IPC their data
>
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to