Re: Arrow for low-latency streaming of small batches?

Wes McKinney Mon, 29 Jun 2020 14:05:59 -0700

On Fri, Jun 26, 2020 at 8:56 AM Chris Osborn <csosb...@fb.com.invalid> wrote:
>
> Yes, it would be quite feasible to preallocate a region large enough for 
> several thousand rows for each column, assuming I read from that region while 
> it's still filling in. When that region is full, I could either allocate a 
> new big chunk or loop around if I no longer need the data. I'm now doing 
> something like that in a revised prototype. Specifically I'm creating 
> builders and calling Reserve() once up front to get a large region, which I 
> then fill in with multiple batches. As the producer fills it in using 
> ArrayBuilder::Append(), the consumers read out earlier rows using 
> ArrayBuilder::GetValue(). This works, but I'm clearly going against the 
> spirit of the library by using builders as ersatz Arrays and a set of 
> builders in lieu of a Table.
>
> In short, it's feasible (and preferable) to preallocate the memory needed, 
> whether it's the builders' memory or the RecordBatch/Table's memory (ideally 
> that's the same thing?). I just haven't been able to figure out how to do 
> that gracefully.

By following the columnar format's buffer layouts [1] it should
straightforward to compute the size of a memory region to preallocate
that represents a RecordBatch's memory and then construct the Buffer
and ArrayData objects that reference each constituent buffer, and then
create a RecordBatch from those ArrayData objects. Some assumptions
must be made of course:

* If a field is nullable, then an empty validity bitmap must be
preallocated (and you can initialize it to all valid or all null based
on what your application prefers)
* Must decide what to do about variable-size allocations for
binary/string types (and extrapolating, analogously for list types if
you have Array/List-like data). So if you preallocated a region that
can accommodate 1024 values then you might allocate 32KB data buffers
for string data (or some factor of the length if you have bigger
strings). If you fill up the data buffer then you will have to move on
to the next region. Another approach might be to let the string data
buffer be a separate ResizableBuffer that you reallocate when you need
to make it bigger

I could envision creating a C++ implementation to manage this whole
process that becomes a part of the Arrow C++ codebase -- preallocate
memory given some global / field-level options and then provide
effectively "UnsafeAppend" APIs to write data into the preallocated
region.

If you create a "parent" RecordBatch that references the preallocated
memory than you can use `RecordBatch::Slice` to "chop" off the filled
portion to pass to your consumer.

[1]: 
https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#buffer-listing-for-each-layout

> Thanks!
> Chris Osborn
>
> ________________________________
> From: Wes McKinney <wesmck...@gmail.com>
> Sent: Thursday, June 25, 2020 10:13 PM
> To: dev <dev@arrow.apache.org>
> Subject: Re: Arrow for low-latency streaming of small batches?
>
> Is it feasible to preallocate the memory region where you are writing
> the record batch?
>
> On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn <csosb...@gmail.com> wrote:
> >
> > Hi,
> >
> > I am investigating Arrow for a project that needs to transfer records from
> > a producer to one or more consumers in small batches (median batch size is
> > 1) and with low latency. The usual structure for something like this would
> > be a single producer multi-consumer queue*. Is there any sane way to use
> > Arrow in this fashion? I have a little C++ prototype that works, but it
> > does the following for each batch of rows:
> >
> > Producer side:
> >     1. construct a set of builders
> >     2. append a value to each builder for each record in the batch
> >     3. finish the builders and use them to make a RecordBatch
> >     4. append the RecordBatch to a vector
> >
> > Consumer side:
> >     1. construct a Table from the vector of RecordBatches
> >     2. slice out the part of the table that the consumer requires (each
> > consumer keeps its own offset)
> >     3. read the data from the resulting sliced table
> >
> > Considering how much work this has to do it performs better than I would
> > have expected, but there's definitely a big fixed cost for each batch of
> > rows (constructing and destructing builders, making Tables that can only be
> > used once since they're immutable, etc). If the batches weren't so small it
> > would probably make sense, but as is it's unworkable. I need to add rows to
> > logical "tables" thousands of times per second in aggregate.
> >
> > Am I just too far from Arrow's big data sweet spot, or is there something
> > I'm missing? I keep reading about IPC and streaming of Arrow data, but I
> > can't find a way to use it at such fine granularity. Thanks in advance for
> > any insights!
> >
> > Thanks!
> > Chris Osborn
> >
> >
> > * yes, I can just use a queue, but the promise of a uniform memory layout
> > that is simultaneously accessible to C++ and Python is very compelling

Re: Arrow for low-latency streaming of small batches?

Reply via email to