Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Wes McKinney Tue, 14 Dec 2021 18:15:01 -0800

Ultimately, the problem comes down to providing a means of O(#
records) selection (take, filter) performance and memory use for
non-numeric data (strings, arrays, maps, etc.).


DuckDB and Velox are two projects which have designed themselves to be
very nearly Arrow-compatible but have implemented alternative memory
layouts to achieve O(# records) selections on all data types. I am
proposing to adopt these innovations as additional memory layouts in
Arrow with a target of zero-copy across the C ABI — how exactly they
are translated to the IPC format seems less of an immediate benefit
than enabling the in-memory performance/memory use optimization since
query engines can accelerate performance with faster selections. If
there are some alternative proposals to achieve O(# records) time and
space complexity for selection operations, let's definitely look at
them.


On Tue, Dec 14, 2021 at 8:02 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> Would it be simpler to change the spec so that child arrays can be
> chunked?  This might reduce the data type growth and make the intent
> more clear.
>
> This will add another dimension to performance analysis.  We pretty
> regularly get issues/tickets from users that have unknowingly created
> parquet files with poor row group resolution (e.g. 50 rows per row
> group) and experience rotten performance as a result.  I suspect
> something similar could happen here.  It sounds like arrays will
> naturally subdivide over time.  Users might start seeing poor
> performance without realizing the root cause is because their 1
> million element array has been split into 10,000 allocations of 100
> elements.  However, I suspect this is something that could be managed
> with visibility and recompaction utilities.
>
>
> On Tue, Dec 14, 2021 at 1:22 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > hi folks,
> >
> > A few things in the general discussion, before certain things will
> > have to be split off into their own dedicated discussions.
> >
> > It seems that I didn't do a very good job of motivating the "sequence
> > view" type. Let me take a step back and discuss one of the problems
> > these new memory layouts are solving.
> >
> > In Arrow currently, selection operations ("take", "filter", or
> > indirect sort — the equivalent of arr.take(argsort(something_else)) if
> > you're coming from NumPy) have time complexity proportional to the
> > number of records for primitive types and complexity proportional to
> > the greater of max(# records, memory size) for nested types.
> >
> > So, for example:
> >
> > * Take(arr, indices) has O(# records) complexity for primitive types
> > and does O(# records) memory allocation
> > * Take(arr, indices) has O(max(# records, size of memory buffers /
> > child arrays)) complexity for strings and nested types and does O(size
> > of memory buffers) memory allocation
> >
> > This means that columnar query engines that leverage selections can
> > experience heavy costs both in time complexity and memory use when
> > doing selections on non-primitive array data. Selections may arise
> > from filtering or sorting or other operations.
> >
> > The "String view" and "Sequence view" memory layouts in this document
> > do not have this problem. When using these for strings and nested
> > data, they have the same time complexity and memory allocation
> > behavior for selections as primitive types, and the "child" memory
> > buffers do not have to be manipulated or rebuilt at all. This has
> > significant performance benefits and reduced memory use.
> >
> > Additionally, the string view and sequence view layouts solve the
> > problem of out-of-order construction. As has been pointed out, one way
> > to work around this issue at present is to use "chunked arrays".
> > However, this means that you cannot ever use thread parallelism in the
> > construction of non-chunked outputs with nested data (for example, in
> > expression evaluation) — if a nested array forms part of a record
> > batch, then either you must stick to single-threaded execution or use
> > thread parallelism to subdivide even the other fields of the record
> > batch that are non-nested to obtain equal-sized arrays across all
> > fields. For example, if you had a record batch with 32K rows and
> > wanted to parallelize execution of a projection using 4 threads — you
> > would need to divide all fields into chunks of 8K each prior to
> > beginning to produce outputs. This is fairly inflexible.
> >
> > As another motivating example, consider a parallel selection operation
> > (e.g. "take" or "filter") on a nested array. Currently it is not
> > possible to parallelize at all because of the in-order construction
> > requirement.
> >
> > I don't expect you to just trust me — here is an example:
> >
> > https://gist.github.com/wesm/25fc7b877f913c7e4449117178302646
> >
> > In this example, I use Take to permute 1M doubles and 1M strings with
> > 50 bytes each
> >
> > * Doubles: 2.45ms (new memory allocated: 8000000)
> > * Strings: 39.6ms (new memory allocated: 54000000)
> >
> > The performance ratio is 16x even though the memory ratio is only ~7x.
> > With the "StringView" data type, only 16000000 bytes of new memory
> > would need to be allocated, and the performance should be only 2-4x
> > slower than the doubles case (because we only need to relocate a bunch
> > of 16-byte structs) instead of 16x slower.
> >
> > I hope you can see now that this can be a rather serious resource
> > utilization issue, both in processing time and memory use. I will
> > update the document to explain this better and work on responding to
> > some of the other comments.
> >
> > Wes
> >
> > On Tue, Dec 14, 2021 at 5:08 AM Antoine Pitrou <anto...@python.org> wrote:
> > >
> > >
> > > Hello,
> > >
> > > I think my main concern is how we can prevent the community from
> > > fragmenting too much over supported encodings.  The more complex the
> > > encodings, the less likely they are to be supported by all main
> > > implementations.  We see this in Parquet where the efficient "delta"
> > > encodings have just received support in Parquet C++, and even, only on
> > > the read side.
> > >
> > > There is an additional subtlety in that Arrow is not a storage mechanism
> > > but it represents data in memory, so pieces doing computation have to be
> > > adapted to the new encodings, for example the entire library of
> > > computation kernels in Arrow C++ (of course, an easy but inefficient
> > > adaptation is to always unpack to an already supported layout).
> > >
> > > As an anecdote, the Arrow C++ kernels are supposed to accept a selection
> > > vector to filter their physical inputs, but none actually supports it.
> > > I think we should be wary of adding ambitious new features that might
> > > never get an actual implementation.
> > >
> > >
> > > On the detail of the proposed encodings:
> > >
> > > - I hope we can avoid storing raw pointers instead of offsets into a
> > > separate buffer; I understand the flexibility argument for pointers but
> > > it will also make data transfer more complicated
> > >
> > > - Constant arrays are a special case of RLE arrays and I'm not sure
> > > doing both is really useful
> > >
> > > - I don't really understand the concrete use case for the weird
> > > "sequence view" layout; I'll note that non-monotonic offsets can make
> > > linear traversal less efficient, since the CPU won't automatically
> > > prefetch data for you
> > >
> > > - The proposed RLE encoding seems inefficient; usually, RLE encodings
> > > try hard to minimize the size overhead of RLE sequences, such that they
> > > become beneficial even for very short repeated runs
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > >
> > > Le 10/12/2021 à 20:28, Wes McKinney a écrit :
> > > >
> > > > This topic may provoke , but, given that Arrow is approaching its
> > > > 6-year anniversary, I think this is an important discussion about how
> > > > we can thoughtfully expand the Arrow specifications to support
> > > > next-generation columnar data processing. In recent times, I have been
> > > > motivated by recent interactions with CWI's DuckDB and Meta's Velox
> > > > open source projects and the innovations they've made around data
> > > > representation providing beneficial features above and beyond what we
> > > > have already in Arrow. For example, they have a 16-byte "string view"
> > > > data type that enables buffer memory reuse, faster "false" comparisons
> > > > on strings unequal in the first 4 bytes, and inline small strings.
> > > > Both the Rust and C++ query engine efforts could potentially benefit
> > > > from this (not sure about the memory safety implications in Rust,
> > > > comments around this would be helpful).
> > > >
> > > > I wrote a document to start a discussion about a few new ways to
> > > > represent data that may help with building
> > > > Arrow-native/Arrow-compatible query engines:
> > > >
> > > > https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
> > > >
> > > > Each of these potential additions would need to be eventually split
> > > > off into independent efforts with associated additions to the columnar
> > > > specification, IPC format, C ABI, integration tests, and so on.
> > > >
> > > > The document is open to anyone to comment but if anyone would like
> > > > edit access please feel free to request and I look forward to the
> > > > discussion.
> > > >
> > > > Thanks,
> > > > Wes
> > > >

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Reply via email to