Ultimately, the problem comes down to providing a means of O(# records) selection (take, filter) performance and memory use for non-numeric data (strings, arrays, maps, etc.).
DuckDB and Velox are two projects which have designed themselves to be very nearly Arrow-compatible but have implemented alternative memory layouts to achieve O(# records) selections on all data types. I am proposing to adopt these innovations as additional memory layouts in Arrow with a target of zero-copy across the C ABI — how exactly they are translated to the IPC format seems less of an immediate benefit than enabling the in-memory performance/memory use optimization since query engines can accelerate performance with faster selections. If there are some alternative proposals to achieve O(# records) time and space complexity for selection operations, let's definitely look at them. On Tue, Dec 14, 2021 at 8:02 PM Weston Pace <weston.p...@gmail.com> wrote: > > Would it be simpler to change the spec so that child arrays can be > chunked? This might reduce the data type growth and make the intent > more clear. > > This will add another dimension to performance analysis. We pretty > regularly get issues/tickets from users that have unknowingly created > parquet files with poor row group resolution (e.g. 50 rows per row > group) and experience rotten performance as a result. I suspect > something similar could happen here. It sounds like arrays will > naturally subdivide over time. Users might start seeing poor > performance without realizing the root cause is because their 1 > million element array has been split into 10,000 allocations of 100 > elements. However, I suspect this is something that could be managed > with visibility and recompaction utilities. > > > On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > hi folks, > > > > A few things in the general discussion, before certain things will > > have to be split off into their own dedicated discussions. > > > > It seems that I didn't do a very good job of motivating the "sequence > > view" type. Let me take a step back and discuss one of the problems > > these new memory layouts are solving. > > > > In Arrow currently, selection operations ("take", "filter", or > > indirect sort — the equivalent of arr.take(argsort(something_else)) if > > you're coming from NumPy) have time complexity proportional to the > > number of records for primitive types and complexity proportional to > > the greater of max(# records, memory size) for nested types. > > > > So, for example: > > > > * Take(arr, indices) has O(# records) complexity for primitive types > > and does O(# records) memory allocation > > * Take(arr, indices) has O(max(# records, size of memory buffers / > > child arrays)) complexity for strings and nested types and does O(size > > of memory buffers) memory allocation > > > > This means that columnar query engines that leverage selections can > > experience heavy costs both in time complexity and memory use when > > doing selections on non-primitive array data. Selections may arise > > from filtering or sorting or other operations. > > > > The "String view" and "Sequence view" memory layouts in this document > > do not have this problem. When using these for strings and nested > > data, they have the same time complexity and memory allocation > > behavior for selections as primitive types, and the "child" memory > > buffers do not have to be manipulated or rebuilt at all. This has > > significant performance benefits and reduced memory use. > > > > Additionally, the string view and sequence view layouts solve the > > problem of out-of-order construction. As has been pointed out, one way > > to work around this issue at present is to use "chunked arrays". > > However, this means that you cannot ever use thread parallelism in the > > construction of non-chunked outputs with nested data (for example, in > > expression evaluation) — if a nested array forms part of a record > > batch, then either you must stick to single-threaded execution or use > > thread parallelism to subdivide even the other fields of the record > > batch that are non-nested to obtain equal-sized arrays across all > > fields. For example, if you had a record batch with 32K rows and > > wanted to parallelize execution of a projection using 4 threads — you > > would need to divide all fields into chunks of 8K each prior to > > beginning to produce outputs. This is fairly inflexible. > > > > As another motivating example, consider a parallel selection operation > > (e.g. "take" or "filter") on a nested array. Currently it is not > > possible to parallelize at all because of the in-order construction > > requirement. > > > > I don't expect you to just trust me — here is an example: > > > > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646 > > > > In this example, I use Take to permute 1M doubles and 1M strings with > > 50 bytes each > > > > * Doubles: 2.45ms (new memory allocated: 8000000) > > * Strings: 39.6ms (new memory allocated: 54000000) > > > > The performance ratio is 16x even though the memory ratio is only ~7x. > > With the "StringView" data type, only 16000000 bytes of new memory > > would need to be allocated, and the performance should be only 2-4x > > slower than the doubles case (because we only need to relocate a bunch > > of 16-byte structs) instead of 16x slower. > > > > I hope you can see now that this can be a rather serious resource > > utilization issue, both in processing time and memory use. I will > > update the document to explain this better and work on responding to > > some of the other comments. > > > > Wes > > > > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > > Hello, > > > > > > I think my main concern is how we can prevent the community from > > > fragmenting too much over supported encodings. The more complex the > > > encodings, the less likely they are to be supported by all main > > > implementations. We see this in Parquet where the efficient "delta" > > > encodings have just received support in Parquet C++, and even, only on > > > the read side. > > > > > > There is an additional subtlety in that Arrow is not a storage mechanism > > > but it represents data in memory, so pieces doing computation have to be > > > adapted to the new encodings, for example the entire library of > > > computation kernels in Arrow C++ (of course, an easy but inefficient > > > adaptation is to always unpack to an already supported layout). > > > > > > As an anecdote, the Arrow C++ kernels are supposed to accept a selection > > > vector to filter their physical inputs, but none actually supports it. > > > I think we should be wary of adding ambitious new features that might > > > never get an actual implementation. > > > > > > > > > On the detail of the proposed encodings: > > > > > > - I hope we can avoid storing raw pointers instead of offsets into a > > > separate buffer; I understand the flexibility argument for pointers but > > > it will also make data transfer more complicated > > > > > > - Constant arrays are a special case of RLE arrays and I'm not sure > > > doing both is really useful > > > > > > - I don't really understand the concrete use case for the weird > > > "sequence view" layout; I'll note that non-monotonic offsets can make > > > linear traversal less efficient, since the CPU won't automatically > > > prefetch data for you > > > > > > - The proposed RLE encoding seems inefficient; usually, RLE encodings > > > try hard to minimize the size overhead of RLE sequences, such that they > > > become beneficial even for very short repeated runs > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > > > > Le 10/12/2021 à 20:28, Wes McKinney a écrit : > > > > > > > > This topic may provoke , but, given that Arrow is approaching its > > > > 6-year anniversary, I think this is an important discussion about how > > > > we can thoughtfully expand the Arrow specifications to support > > > > next-generation columnar data processing. In recent times, I have been > > > > motivated by recent interactions with CWI's DuckDB and Meta's Velox > > > > open source projects and the innovations they've made around data > > > > representation providing beneficial features above and beyond what we > > > > have already in Arrow. For example, they have a 16-byte "string view" > > > > data type that enables buffer memory reuse, faster "false" comparisons > > > > on strings unequal in the first 4 bytes, and inline small strings. > > > > Both the Rust and C++ query engine efforts could potentially benefit > > > > from this (not sure about the memory safety implications in Rust, > > > > comments around this would be helpful). > > > > > > > > I wrote a document to start a discussion about a few new ways to > > > > represent data that may help with building > > > > Arrow-native/Arrow-compatible query engines: > > > > > > > > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit# > > > > > > > > Each of these potential additions would need to be eventually split > > > > off into independent efforts with associated additions to the columnar > > > > specification, IPC format, C ABI, integration tests, and so on. > > > > > > > > The document is open to anyone to comment but if anyone would like > > > > edit access please feel free to request and I look forward to the > > > > discussion. > > > > > > > > Thanks, > > > > Wes > > > >