To chime in from the Go side, all Go memory is typically automatically zero-ed out anyways, so by default the Go implementation will also fit this invariant with all of the buffers being zero-ed out.
On Mon, Jul 31, 2023 at 5:18 PM Pedro Eugenio Rocha Pedreira <pedro...@meta.com.invalid> wrote: > > My only major comment so far is that views for null slots I think should > be > > well defined, that is they shouldn't reference buffers or data that > > doesn't exist. This is very important for the Rust implementation to be > > able to provide efficient safe APIs. > > Quickly chiming in here again: FWIW, we had a similar discussion for > Velox, and > eventually settled on the invariant that elements should always be > copiable, > in order to make the codebase simpler and less error prone. We keep this > invariant throughout the engine. This essentially means that > Utf8View/BinaryView > elements are always correctly initialized by zeroing out their lengths, > and that > complex types have sane offsets/lengths values (related to the parallel > thread on > ListViews). > > Best, > -- > Pedro Pedreira > ________________________________ > From: Raphael Taylor-Davies <r.taylordav...@googlemail.com.INVALID> > Sent: Monday, July 31, 2023 12:50 AM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Subject: Re: [DISCUSS][Format] Draft implementation of string view array > format > > !-------------------------------------------------------------------| > This Message Is From an External Sender > > |-------------------------------------------------------------------! > > Hi All, > > Having played around with various alternatives, I think we should move > ahead with this proposal. I think Utf8View and BinaryView would be a > valuable addition to the arrow specification, and would address many of > the current pain points encountered when dealing with large string > payloads or string dictionaries. In particular, they avoid the need to > copy string data in all bar exceptional circumstances, whilst also > supporting string interning in a manner that doesn't require recomputing > dictionaries when combining data streams. Thank you all for the > insightful discussion. > > In that vein I have started implementing them within arrow-rs [1]. My > only major comment so far is that views for null slots I think should be > well defined, that is they shouldn't reference buffers or data that > doesn't exist. This is very important for the Rust implementation to be > able to provide efficient safe APIs. > > Otherwise I don't see any major blockers towards getting these > standardised, thank you for your continued efforts in this space. > > Kind Regards, > > Raphael Taylor-Davies > > [1]: https://github.com/apache/arrow-rs/pull/4585 > > On 12/07/2023 17:35, Pedro Eugenio Rocha Pedreira wrote: > > Hi all, this is Pedro from the Velox team at Meta. Chiming in here to > add a bit more context from our side. > > > >> I'm not sure the problem here is a lack of understanding or maturity. In > >> fact, it would be much easier if this was just a problem of education > but > >> it is not. > > Adding to what Weston said, when Velox started the intent was to have it > built on top of vanilla Arrow, if not directly using the Arrow C++ library, > at least having an implementation conforming to the data layout. The fact > we decided to "extended" the format into what we internally call Velox > Vectors was very much a deliberate decision. If Arrow had at the time > support for StringView, ListViews (discussed recently in a separate thread, > related to supporting out-of-order writes), and more encodings > (specifically RLE and Constant, which can now be done through new REE), > Velox would have used it from the beginning. The rationale behind these > extensions has been discussed here, but there is more context in our paper > last year [0] (check Section 4.2). > > > >> I can't speak for all query engines, but at least in the case of > >> DataFusion we exclusively use the Arrow format as the interchange format > >> between operators, including for UDFs. We have found that for most > >> operators operating directly on the Arrow format is sufficiently > >> performant to not represent a query bottleneck. > > This always comes down to your workloads and operations you are > optimizing for. We found that these extensions were crucial for our > workloads, particularly for efficient vectorized expression evaluation. > Other modern engines like DuckDB and Umbra had comparable findings and > deviate from Arrow in similar ways. > > > >> Is Arrow meant to only be used in between systems (in this case query > >> engines) or is it also meant to be used in between components of a query > >> engine? > > Our findings were that if you really care about performance, Arrow today > is (unfortunately) not sufficient to support the needs of a > state-of-the-art execution engine. Raphael raised a good point above about > Arrow potentially favoring interoperability over performance. In that case, > I believe Arrow's usage would be restricted to communication between > systems, which is about what we see in practice today. > > > > However, even between system boundaries this is becoming a hurdle. A > recent example is a new project called Gluten [1], which integrates Velox > into Spark (akin to Databrick's Photon). The JNI data communication between > Velox and Java is done using Arrow C ABI. We found that the > incompatibilities that result in non-zero-copy transfer (back to > StringViews) show up high enough as a bottleneck that the team is > considering using Velox Vectors as the exchange format to go around this > limitation. We argued with the Gluten team to stick with Arrow with the > hope we would work with the community to get these extensions incorporated > into the standard. Another example is a bridge between Velox and DuckDB > built inside Velox that converts these formats directly from one another. > We don't use Arrow there for similar reasons. There are other similar > examples between PyTorch's pythonic ecosystem and Velox. > > > >> I therefore also wonder about the possibility of always having a single > >> backing buffer that stores the character data, including potentially a > copy > >> of the prefix. > > There are quite a few use cases for this within Velox and other modern > engines, but AFAIK they boil down to two reasons: > > > > 1. When you're creating string buffers you usually don't know the size > of each string beforehand. This allows us to allocate the buffer > page-by-page as you go, then capturing these buffers without having to > reallocate and copy them. The simplest use case here is a vectorized > function that generates a string as output. > > 2. It allows us to create StringView arrays without copying the > underlying data. Use cases here are the receiving end of an exchange > (shuffle), or a simple table scan. In the former, you can capture the pages > you directly read from the socket without copying them (we internally use a > format called IOBuf from folly, which is sort of a linked list of buffers). > In the latter, the pages are usually cached in a memory cache, so you can > simply acquire pointers to these page caches. All of that to prevent > unnecessary copies. > > > > > > [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf > > [1] - https://github.com/oap-project/gluten > > > > Best, > > > > [ > https://opengraph.githubassets.com/7251524e6ed3ec08793b7dd25fd62e13a4e0f7f985179d5c9f53c9d9fc365ba0/oap-project/gluten > ]<https://github.com/oap-project/gluten> > > GitHub - oap-project/gluten: Gluten: Plugin to Double SparkSQL's > Performance<https://github.com/oap-project/gluten> > > Gluten: Plugin to Double SparkSQL's Performance. Contribute to > oap-project/gluten development by creating an account on GitHub. > > github.com > > > > > > > > > > > > > > -- > > Pedro Pedreira > > ________________________________ > > From: Weston Pace <weston.p...@gmail.com> > > Sent: Tuesday, July 11, 2023 8:42 AM > > To: dev@arrow.apache.org <dev@arrow.apache.org> > > Subject: Re: [DISCUSS][Format] Draft implementation of string view array > format > > > > !-------------------------------------------------------------------| > > This Message Is From an External Sender > > > > |-------------------------------------------------------------------! > > > >> I definitely hope that with time Arrow will penetrate deeper into these > >> engines, perhaps in a similar manner to DataFusion, as opposed to > >> primarily existing at the surface-level. > > I'm not sure the problem here is a lack of understanding or maturity. In > > fact, it would be much easier if this was just a problem of education but > > it is not. Velox is already using Arrow formats (internally, not just at > > the boundaries) throughout their system. Same with DuckDb. Both teams > are > > very familiar with Arrow. They have just made the decision that the > string > > format is not the correct format to use in a query engine for string > data. > > > > I don't think there is a single best layout for all situations. I don't > > think we want to be in the game of declaring which layout is the correct > > layout for query engines. I suspect there will be cases where the > > appropriate layout is highly dependent on the workload. > > > >> I just wonder if inclusion in the primary > >> standard is really the right place for them. Perhaps some extension > >> mechanism might be the way to go here, potentially with some negotiation > >> mechanism, I'm not really sure > >> I agree 100% that this sort of interoperability is what makes Arrow so > >> compelling and something we should work very hard to preserve. This is > >> the crux of my concern with standardising alternative layouts > > I did have a proposal for alternative layouts when we were discussing the > > array view layout. I'll repeat it here for completeness: > > > > * There are one or more primary layouts > > * Existing layouts are automatically considered primary layouts, > even if > > they wouldn't > > have been primary layouts initially (e.g. large list) > > * A new layout, if it is semantically equivalent to another, is > considered > > an alternative layout > > * An alternative layout still has the same requirements for adoption > (two > > implementations and a vote) > > * An implementation should not feel pressured to rush and implement > the > > new layout. > > It would be good if they contribute in the discussion and consider > the > > layout and vote > > if they feel it would be an acceptable design. > > * We can define and vote and approve as many canonical alternative > layouts > > as we want: > > * A canonical alternative layout should, at a minimum, have some > > reasonable justification, such as improved performance for > algorithm X > > * Arrow implementations MUST support the primary layouts > > * An Arrow implementation MAY support a canonical alternative, however: > > * An Arrow implementation MUST first support the primary layout > > * An Arrow implementation MUST support conversion to/from the primary > > and canonical layout > > * An Arrow implementation's APIs MUST only provide data in the > > alternative > > layout if it is explicitly asked for (e.g. schema inference should > > prefer the primary layout). > > * We can still vote for new primary layouts (e.g. promoting a canonical > > alternative) but, in these > > votes we don't only consider the value (e.g. performance) of the > layout > > but also the interoperability. > > In other words, a layout can only become a primary layout if there > is > > significant evidence that most > > implementations plan to adopt it. > > > > On Mon, Jul 10, 2023 at 9:49 AM Raphael Taylor-Davies > > <r.taylordav...@googlemail.com.invalid> wrote: > > > >>> For example, if someone (datafusion, velox, etc.) were to come up with > a > >>> framework for UDFs then would batches be passed in and out of those > UDFs > >> in > >>> the Arrow format? > >> Yes, I think the arrow format is a perfect fit for this > >>> Is Arrow meant to only be used in between systems (in this case query > >>> engines) or is it also meant to be used in between components of a > query > >>> engine? > >> I can't speak for all query engines, but at least in the case of > >> DataFusion we exclusively use the Arrow format as the interchange format > >> between operators, including for UDFs. We have found that for most > >> operators operating directly on the Arrow format is sufficiently > >> performant to not represent a query bottleneck. For others, such as > >> joins, sorts and aggregates, we do make use of bespoke data structures > >> and formats internally, e.g. hash tables, row formats, etc..., but the > >> operator's public APIs are still in terms of arrow RecordBatch. We have > >> found this approach to perform very well, whilst also providing very > >> good modularity and composability. > >> > >> In fact we are actually currently in the process of migrating the > >> aggregation logic away from a bespoke mutable row representation to the > >> Arrow model, and are already seeing significant performance > >> improvements, not to mention a significant reduction in code complexity > >> and improved composability [1]. > >> > >>> If every engine has its own bespoke formats internally > >>> then it seems we are placing a limit on how far things can be > decomposed. > >> Agreed, if engines choose to implement operations on bespoke formats, > >> these operations will likely not be as interoperable as those > >> implemented using Arrow. To what extent an engine favours their own > >> format(s) over Arrow will be an engineering trade-off they will have to > >> make, but DataFusion has found exclusively using Arrow as the > >> interchange format between operators to work well. > >> > >>> There are now multiple implementations of a query > >>> engine and I think we are seeing just the edges of this query engine > >>> decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or > >> consuming > >>> a velox task as a record batch stream into a different system) and > these > >>> sorts of challenges are in the forefront. > >> I agree 100% that this sort of interoperability is what makes Arrow so > >> compelling and something we should work very hard to preserve. This is > >> the crux of my concern with standardising alternative layouts. I > >> definitely hope that with time Arrow will penetrate deeper into these > >> engines, perhaps in a similar manner to DataFusion, as opposed to > >> primarily existing at the surface-level. > >> > >> [1]: https://github.com/apache/arrow-datafusion/pull/6800 > >> > >> On 10/07/2023 11:38, Weston Pace wrote: > >>>> The point I was trying to make, albeit very badly, was that these > >>>> operations are typically implemented using some sort of row format [1] > >>>> [2], and therefore their performance is not impacted by the array > >>>> representations. I think it is both inevitable, and in fact something > to > >>>> be encouraged, that query engines will implement their own in-memory > >>>> layouts and data structures outside of the arrow specification for > >>>> specific operators, workloads, hardware, etc... This allows them to > make > >>>> trade-offs based on their specific application domain, whilst also > >>>> ensuring that new ideas and approaches can continue to be incorporated > >>>> and adopted in the broader ecosystem. However, to then seek to > >>>> standardise these layouts seems to be both potentially unbounded scope > >>>> creep, and also somewhat counter productive if the goal of > >>>> standardisation is improved interoperability? > >>> FWIW, I believe this formats are very friendly for row representation > as > >>> well, especially when stored as a payload (e.g. in a join). > >>> > >>> For your more general point though I will ask the same question I asked > >> on > >>> the ArrayView discussion: > >>> > >>> Is Arrow meant to only be used in between systems (in this case query > >>> engines) or is it also meant to be used in between components of a > query > >>> engine? > >>> > >>> For example, if someone (datafusion, velox, etc.) were to come up with > a > >>> framework for UDFs then would batches be passed in and out of those > UDFs > >> in > >>> the Arrow format? If every engine has its own bespoke formats > internally > >>> then it seems we are placing a limit on how far things can be > decomposed. > >>> From the C++ perspective, I would personally like to see Arrow be > usable > >>> within components. There are now multiple implementations of a query > >>> engine and I think we are seeing just the edges of this query engine > >>> decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or > >> consuming > >>> a velox task as a record batch stream into a different system) and > these > >>> sorts of challenges are in the forefront. > >>> > >>> On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies > >>> <r.taylordav...@googlemail.com.invalid> wrote: > >>> > >>>>> Thus the approach you > >>>>> describe for validating an entire character buffer as UTF-8 then > >> checking > >>>>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays. > >>>> The difference here is that it is perhaps expected for Utf8View to > have > >>>> gaps in the underlying data that are not referenced as part of any > >>>> value, as I had understood this to be one of its benefits over the > >>>> current encoding. I think it would therefore be problematic to enforce > >>>> these gaps be UTF-8. > >>>> > >>>>> Furthermore unlike an explicit > >>>>> selection vector a kernel may decide to copy and densify dynamically > if > >>>> it > >>>>> detects that output is getting sparse or fragmented > >>>> I don't see why you couldn't do something similar to materialize a > >>>> sparse selection vector, if anything being able to centralise this > logic > >>>> outside specific kernels would be advantageous. > >>>> > >>>>> Specifically sorting and equality comparison > >>>>> benefit significantly from the prefix comparison fast path, > >>>>> so I'd anticipate that multi column sorting and aggregations would as > >>>> well > >>>> > >>>> The point I was trying to make, albeit very badly, was that these > >>>> operations are typically implemented using some sort of row format [1] > >>>> [2], and therefore their performance is not impacted by the array > >>>> representations. I think it is both inevitable, and in fact something > to > >>>> be encouraged, that query engines will implement their own in-memory > >>>> layouts and data structures outside of the arrow specification for > >>>> specific operators, workloads, hardware, etc... This allows them to > make > >>>> trade-offs based on their specific application domain, whilst also > >>>> ensuring that new ideas and approaches can continue to be incorporated > >>>> and adopted in the broader ecosystem. However, to then seek to > >>>> standardise these layouts seems to be both potentially unbounded scope > >>>> creep, and also somewhat counter productive if the goal of > >>>> standardisation is improved interoperability? I fully expect in the > next > >>>> 5 years someone will come up with an even better way to encode strings > >>>> for some particular workload or hardware, do we then incorporate that > as > >>>> well? > >>>> > >>>> I guess it boils down to what matters to people more, interoperability > >>>> or best-in-class performance? Currently I think it is fair to say both > >>>> arrow and parquet favour interoperability over performance, aiming to > >>>> provide good enough performance broadly on the same order of magnitude > >>>> as a custom solution. I personally think this is the right engineering > >>>> trade-off, but appreciate opinions may differ. Ultimately I just > really > >>>> want arrow to avoid the situation parquet has found itself in, where > the > >>>> specification has both far outstripped the ability for the > >>>> implementations to keep pace, whilst simultaneously having > standardised > >>>> approaches for things like delta encoding that are now considered > >>>> extremely sub-optimal for modern hardware. > >>>> > >>>> That all being said I'm not against adding support for these arrays if > >>>> others are already onboard, I just wonder if inclusion in the primary > >>>> standard is really the right place for them. Perhaps some extension > >>>> mechanism might be the way to go here, potentially with some > negotiation > >>>> mechanism, I'm not really sure... I will continue to think on this > >>>> > >>>> Kind Regards, > >>>> > >>>> Raphael > >>>> > >>>> [1]: > >>>> > >>>> > >> > https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison > >>>> [2]: https://docs.rs/arrow-row/latest/arrow_row/ > >>>> > >>>> On 06/07/2023 17:47, Benjamin Kietzman wrote: > >>>>> @Andrew: > >>>>> > >>>>> Restricting these arrays to a single buffer will severely decrease > >> their > >>>>> utility. Since the character data is stored in multiple character > >> buffers > >>>>> writing Utf8View array can proceed without resizing allocations, > >>>>> which is a major overhead when writing Utf8 arrays. Furthermore since > >> the > >>>>> character buffers have no restrictions on their size, it's > >>>> straightforward > >>>>> to > >>>>> reuse an existing buffer as a character buffer rather than always > >>>> allocating > >>>>> a new one. In the case of creating an array which shares a lot of > data > >>>> with > >>>>> another (for example, appending some strings) we can reuse most of > the > >>>>> character buffers from the original. Finally Utf8View is well adapted > >> for > >>>>> efficiently wrapping non-arrow string data for ingestion by a kernel, > >>>> even > >>>>> if the string data's full extent is not known ahead of time and is > >> spread > >>>>> across multiple non-contiguous buffers. > >>>>> > >>>>> @Raphael: > >>>>> > >>>>>> branch on access > >>>>> The branch-on-access is unavoidable since a primary feature of the > >>>> Utf8View > >>>>> format is keeping short strings inline in the fixed width portion of > >>>> data. > >>>>> It's worth noting that the inline prefix allows skipping the branch > >>>> entirely > >>>>> for common cases of comparison, for example when the strings to be > >>>> compared > >>>>> differ within the first 4 bytes. > >>>>> > >>>>> In benchmarking (for example while building a hash table) I have not > >>>>> observed > >>>>> that this branch overly pessimizes access. Although I can't guarantee > >>>> every > >>>>> Utf8View array will be more efficient than any Utf8 array, it is > >>>> certainly > >>>>> faster for many relevant cases. Specifically sorting and equality > >>>> comparison > >>>>> benefit significantly from the prefix comparison fast path, > >>>>> so I'd anticipate that multi column sorting and aggregations would as > >>>> well. > >>>>> If there are any other benchmarks which would help to justify > Utf8View > >> in > >>>>> your > >>>>> mind, I'd be happy to try writing them. > >>>>> > >>>>>> UTF-8 validation for StringArray can be done very efficiently by > first > >>>>> verifying the entire buffer, and then verifying the offsets > correspond > >> to > >>>>> the start of a UTF-8 codepoint > >>>>> > >>>>> For non-inlined strings, the character buffers do always contain the > >>>> entire > >>>>> string's data and not just the last `len - 4` bytes. Thus the > approach > >>>> you > >>>>> describe for validating an entire character buffer as UTF-8 then > >> checking > >>>>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays. > >>>>> > >>>>>> it does seem inconsistent to use unsigned types > >>>>> It is indeed more typical for the arrow format to use signed integers > >> for > >>>>> offsets and other quantities. In this case there is prior art in > other > >>>>> engines with which we can remain compatible by using unsigned > integers > >>>>> instead. Since this is only a break with convention within the format > >> and > >>>>> shouldn't be difficult for any implementation to accommodate, I would > >>>> argue > >>>>> that it's worthwhile to avoid pushing change onto existing > >> implementers. > >>>>>> I presume that StringView will behave similarly to dictionaries in > >> that > >>>>> the selection kernels will not recompute the underlying value > buffers. > >>>>> > >>>>> The Utf8View format itself is not prescriptive of selection > operations > >> on > >>>>> the > >>>>> array; kernels are free to reuse character buffers (which produces an > >>>>> implicit > >>>>> selection vector) or to recompute them. Furthermore unlike an > explicit > >>>>> selection vector a kernel may decide to copy and densify dynamically > if > >>>> it > >>>>> detects that output is getting sparse or fragmented. It's also worth > >>>> noting > >>>>> that unlike an explicit selection vector a Utf8View array (however > >>>> sparse or > >>>>> fragmented) will still benefit from the prefix comparison fast path. > >>>>> > >>>>> Sincerely, > >>>>> Ben Kietzman > >>>>> > >>>>> On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies > >>>>> <r.taylordav...@googlemail.com.invalid> wrote: > >>>>> > >>>>>>> I would be interested in hearing some input from the Rust > community. > >>>>>> A couple of thoughts: > >>>>>> > >>>>>> The variable number of buffers would definitely pose some challenges > >> for > >>>>>> the Rust implementation, the closest thing we currently have is > >> possibly > >>>>>> UnionArray, but even then the number of buffers is still determined > >>>>>> statically by the DataType. I therefore also wonder about the > >>>> possibility > >>>>>> of always having a single backing buffer that stores the character > >> data, > >>>>>> including potentially a copy of the prefix. This would also avoid > >>>> forcing a > >>>>>> branch on access, which I would have expected to hurt performance > for > >>>> some > >>>>>> kernels quite significantly. > >>>>>> > >>>>>> Whilst not really a concern for Rust, which supports unsigned types, > >> it > >>>>>> does seem inconsistent to use unsigned types where the rest of the > >>>> format > >>>>>> encourages the use of signed offsets, etc... > >>>>>> > >>>>>> It isn't clearly specified whether a null should have a valid set of > >>>>>> offsets, etc... I think it is an important property of the current > >> array > >>>>>> layouts that, with exception to dictionaries, the data in null slots > >> is > >>>>>> arbitrary, i.e. can take any value, but not undefined. This allows > for > >>>>>> separate handling of the null mask and values, which can be > important > >>>> for > >>>>>> some kernels and APIs. > >>>>>> > >>>>>> More an observation than an issue, but UTF-8 validation for > >> StringArray > >>>>>> can be done very efficiently by first verifying the entire buffer, > and > >>>> then > >>>>>> verifying the offsets correspond to the start of a UTF-8 codepoint. > >> This > >>>>>> same approach would not be possible for StringView, which would need > >> to > >>>>>> verify individual values and would therefore be significantly more > >>>>>> expensive. As it is UB for a Rust string to contain non-UTF-8 data, > >> this > >>>>>> validation is perhaps more important for Rust than for other > >> languages. > >>>>>> I presume that StringView will behave similarly to dictionaries in > >> that > >>>>>> the selection kernels will not recompute the underlying value > >> buffers. I > >>>>>> think this is fine, but it is perhaps worth noting this has caused > >>>>>> confusion in the past, as people somewhat reasonably expect an array > >>>>>> post-selection to have memory usage reflecting the smaller > selection. > >>>> This > >>>>>> is then especially noticeable if the data is written out to IPC, and > >>>> still > >>>>>> contains data that was supposedly filtered out. My 2 cents is that > >>>> explicit > >>>>>> selection vectors are a less surprising way to defer selection than > >>>> baking > >>>>>> it into the array, but I also don't have any workloads where this is > >> the > >>>>>> major bottleneck so can't speak authoritatively here. > >>>>>> > >>>>>> Which leads on to my major concern with this proposal, that it adds > >>>>>> complexity and cognitive load to the specification and > >> implementations, > >>>>>> whilst not meaningfully improving the performance of the operators > >> that > >>>> I > >>>>>> commonly encounter as performance bottlenecks, which are > multi-column > >>>> sorts > >>>>>> and aggregations, or the expensive string operations such as > matching > >> or > >>>>>> parsing. If we didn't already have a string representation I would > be > >>>> more > >>>>>> onboard, but as it stands I'm definitely on the fence, especially > >> given > >>>>>> selection performance can be improved in less intrusive ways using > >>>>>> dictionaries or selection vectors. > >>>>>> > >>>>>> Kind Regards, > >>>>>> > >>>>>> Raphael Taylor-Davies > >>>>>> > >>>>>> On 02/07/2023 11:46, Andrew Lamb wrote: > >>>>>> > >>>>>> * This is the first layout where the number of buffers depends > on > >> the > >>>>>> data > >>>>>> > >>>>>> and not the schema. I think this is the most architecturally > >> significant > >>>>>> fact. I > >>>>>> > >>>>>> I have spent some time reading the initial proposal -- thank you > >> for > >>>>>> that. I now understand what Weston was saying about the "variable > >>>> numbers > >>>>>> of buffers". I wonder if you considered restricting such arrays to a > >>>> single > >>>>>> buffer (so as to make them more similar to other arrow array types > >> that > >>>>>> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM > >> Weston > >>>>>> Pace<weston.p...@gmail.com> <mailto:weston.p...@gmail.com> wrote: > >>>>>> > >>>>>> Before I say anything else I'll say that I am in favor of this new > >>>> layout. > >>>>>> There is some existing literature on the idea (e.g. umbra) and your > >>>>>> benchmarks show some nice improvements. Compared to some of the > other > >>>>>> layouts we've discussed recently (REE, list veiw) I do think this > >>>> layout is > >>>>>> more unique and fundamentally different. Perhaps most fundamentally > >>>>>> different: * This is the first layout where the number of buffers > >>>> depends > >>>>>> on the data and not the schema. I think this is the most > >> architecturally > >>>>>> significant fact. It does require a (backwards compatible) change to > >> the > >>>>>> IPC format itself, beyond just adding new type codes. It also poses > >>>>>> challenges in places where we've assumed there will be at most 3 > >> buffers > >>>>>> (e.g. in ArraySpan, though, as you have shown, we can work around > this > >>>>>> using a raw pointers representation internally in those spots). I > >> think > >>>>>> you've done some great work to integrate this well with Arrow-C++ > and > >>>> I'm > >>>>>> convinced it can work. I would be interested in hearing some input > >> from > >>>> the > >>>>>> Rust community. Ben, at one point there was some discussion that > this > >>>> might > >>>>>> be a c-data only type. However, I believe that was based on the raw > >>>>>> pointers representation. What you've proposed here, if I understand > >>>>>> correctly, is an index + offsets representation and it is suitable > for > >>>> IPC > >>>>>> correct? (e.g. I see that you have changes and examples in the IPC > >>>>>> reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman < > >>>>>> bengil...@gmail.com> <mailto:bengil...@gmail.com> wrote: > >>>>>> > >>>>>> Hi Gang, I'm not sure what you mean, sorry if my answers are off > base: > >>>>>> Parquet's ByteArray will be unaffected by the addition of the string > >>>> view > >>>>>> type; all arrow strings > >> (arrow::Type::STRING,arrow::Type::LARGE_STRING, > >>>>>> and with this patcharrow::Type::STRING_VIEW) are converted to > >> ByteArrays > >>>>>> during serialization to parquet [1]. If you mean that encoding of > >>>>>> arrow::Type::STRING_VIEW will not be as fast as encoding of > >> equivalent > >>>>>> arrow::Type::STRING, that's something I haven't benchmarked so I > can't > >>>>>> answer definitively. I would expect it to be > >>>>>> > >>>>>> faster > >>>>>> > >>>>>> than first converting STRING_VIEW->STRING then encoding to parquet; > >>>> direct > >>>>>> encoding avoids allocating and populating temporary buffers. Of > course > >>>>>> > >>>>>> this > >>>>>> > >>>>>> only applies to cases where you need to encode an array of > STRING_VIEW > >>>> to > >>>>>> parquet- encoding of STRING to parquet will be unaffected. > Sincerely, > >>>> Ben > >>>>>> [1] > >>>>>> > >>>>>> > >>>>>> > >> > https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179 > >>>>>> On Thu, Jun 15, 2023 at 10:34 PM Gang Wu<ust...@gmail.com> > >> <mailto: > >>>> ust...@gmail.com> wrote: > >>>>>> Hi Ben, The posted benchmark [1] looks pretty good to me. However, I > >>>> want > >>>>>> to raise a possible issue from the perspective of parquet-cpp. > >>>> Parquet-cpp > >>>>>> uses a customizedparquet::ByteArray type [2] for string/binary, I > >>>>>> > >>>>>> would > >>>>>> > >>>>>> expect some regression of conversions between parquet reader/writer > >> and > >>>>>> the proposed string view array, especially when some strings use > short > >>>> form > >>>>>> and others use long form. [1] > >>>>>> > >>>>>> > >>>>>> > >> > https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575 > >>>>>> [2] > https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > >>>>>> Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones < > >>>>>> will.jones...@gmail.com> <mailto:will.jones...@gmail.com> wrote: > >>>>>> > >>>>>> Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin > >>>>>> Kietzman <bengil...@gmail.com <mailto:bengil...@gmail.com> > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>> I've addedhttps://github.com/apache/arrow/issues/36112 to track > >>>>>> deduplication of buffers on write. I don't think it would require > >>>>>> modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM > >> Matt > >>>>>> Topol <zotthewiz...@gmail.com <mailto:zotthewiz...@gmail.com> > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>> Based on my understanding, in theory a buffer *could* be shared > >>>>>> > >>>>>> within > >>>>>> > >>>>>> a > >>>>>> > >>>>>> batch since the flatbuffers message just uses an offset and > >>>>>> > >>>>>> length > >>>>>> > >>>>>> to > >>>>>> > >>>>>> identify the buffers. That said, I don't believe any current > >>>>>> implementation actually > >>>>>> > >>>>>> does > >>>>>> > >>>>>> this > >>>>>> > >>>>>> or > >>>>>> > >>>>>> takes advantage of this in any meaningful way. --Matt On Thu, Jun > 15, > >>>> 2023 > >>>>>> at 1:00 PM Will Jones < > >>>>>> > >>>>>> will.jones...@gmail.com <mailto:will.jones...@gmail.com>> > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>> Hi Ben, It's exciting to see this move along. The buffers will be > >>>>>> duplicated. If buffer duplication is > >>>>>> > >>>>>> becomes > >>>>>> > >>>>>> a > >>>>>> > >>>>>> concern, > >>>>>> > >>>>>> I'd prefer to handle that in the ipc writer. Then buffers which are > >>>>>> duplicated > >>>>>> > >>>>>> could > >>>>>> > >>>>>> be > >>>>>> > >>>>>> detected > >>>>>> > >>>>>> by checking pointer identity and written only once. > >>>>>> > >>>>>> Question: to be able to write buffer only once and reference in > >>>>>> > >>>>>> multiple > >>>>>> > >>>>>> arrays, does that require a change to the IPC format? Or is > >>>>>> > >>>>>> sharing > >>>>>> > >>>>>> buffers > >>>>>> > >>>>>> within the same batch already allowed in the IPC format? Best, Will > >>>> Jones > >>>>>> On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman < > >>>>>> > >>>>>> bengil...@gmail.com <mailto:bengil...@gmail.com> > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>> Hello again all, The PR [1] to add string view to the format and the > >> C++ > >>>>>> implementation > >>>>>> > >>>>>> is > >>>>>> > >>>>>> hovering around passing CI and has been undrafted. > >>>>>> > >>>>>> Furthermore, > >>>>>> > >>>>>> there > >>>>>> > >>>>>> is > >>>>>> > >>>>>> now also a PR [2] to add string view to the Go > >>>>>> > >>>>>> implementation. > >>>>>> > >>>>>> Code > >>>>>> > >>>>>> review > >>>>>> > >>>>>> is underway for each PR and I'd like to move toward a vote > >>>>>> > >>>>>> for > >>>>>> > >>>>>> acceptance- > >>>>>> > >>>>>> are there any other preliminaries which I've neglected? To reiterate > >> the > >>>>>> answers to some past questions: - Benchmarks are added in the C++ PR > >>>> [1] to > >>>>>> demonstrate the > >>>>>> > >>>>>> performance > >>>>>> > >>>>>> of > >>>>>> > >>>>>> conversion between the various string formats. In addition, > >>>>>> > >>>>>> there > >>>>>> > >>>>>> are > >>>>>> > >>>>>> some benchmarks which demonstrate the performance gains > >>>>>> > >>>>>> available > >>>>>> > >>>>>> with > >>>>>> > >>>>>> the new format [3]. - Adding string view to the C ABI is a > natural > >>>> follow > >>>>>> up, but > >>>>>> > >>>>>> should > >>>>>> > >>>>>> be > >>>>>> > >>>>>> handled independently. An issue has been added to track > >>>>>> > >>>>>> that > >>>>>> > >>>>>> enhancement [4]. Sincerely, Ben Kietzman [1] > >>>>>> https://github.com/apache/arrow/pull/35628 [2] > >>>>>> https://github.com/apache/arrow/pull/35769 [3] > >>>>>> > >>>>>> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 > >>>>>> > >>>>>> [4]https://github.com/apache/arrow/issues/36099 On Wed, May 17, > 2023 > >>>> at > >>>>>> 12:53 PM Benjamin Kietzman < > >>>>>> > >>>>>> bengil...@gmail.com <mailto:bengil...@gmail.com>> > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>> @Jacob > >>>>>> > >>>>>> You mention benchmarks multiple times, are these results > >>>>>> > >>>>>> published > >>>>>> > >>>>>> somewhere? I benchmarked the performance of raw pointer vs index > >>>>>> > >>>>>> offset > >>>>>> > >>>>>> views > >>>>>> > >>>>>> in > >>>>>> > >>>>>> my > >>>>>> > >>>>>> PR to velox, I do intend to port them to my arrow PR but I haven't > >>>>>> > >>>>>> gotten > >>>>>> > >>>>>> there > >>>>>> > >>>>>> yet. > >>>>>> > >>>>>> Furthermore, it seemed less urgent to me since coexistence of the > two > >>>>>> types > >>>>>> > >>>>>> in > >>>>>> > >>>>>> the > >>>>>> > >>>>>> c++ > >>>>>> > >>>>>> implementation defers the question of how aggressively one should be > >>>>>> > >>>>>> preferred > >>>>>> > >>>>>> over > >>>>>> > >>>>>> the > >>>>>> > >>>>>> other. @Dewey > >>>>>> > >>>>>> I don't see the C Data interface in the PR > >>>>>> > >>>>>> I have not addressed the C ABI in this PR. As you mention, > >>>>>> > >>>>>> it > >>>>>> > >>>>>> may > >>>>>> > >>>>>> be > >>>>>> > >>>>>> useful to transmit arrays with raw pointer views between > >> implementations > >>>>>> which > >>>>>> > >>>>>> allow > >>>>>> > >>>>>> them. I > >>>>>> > >>>>>> can address this in a follow up PR. @Will > >>>>>> > >>>>>> If I understand correctly, multiple arrays can reference > >>>>>> > >>>>>> the > >>>>>> > >>>>>> same > >>>>>> > >>>>>> buffers > >>>>>> > >>>>>> in memory, but once they are written to IPC their data > >