Yes, it would be quite feasible to preallocate a region large enough for 
several thousand rows for each column, assuming I read from that region while 
it's still filling in. When that region is full, I could either allocate a new 
big chunk or loop around if I no longer need the data. I'm now doing something 
like that in a revised prototype. Specifically I'm creating builders and 
calling Reserve() once up front to get a large region, which I then fill in 
with multiple batches. As the producer fills it in using 
ArrayBuilder::Append(), the consumers read out earlier rows using 
ArrayBuilder::GetValue(). This works, but I'm clearly going against the spirit 
of the library by using builders as ersatz Arrays and a set of builders in lieu 
of a Table.

In short, it's feasible (and preferable) to preallocate the memory needed, 
whether it's the builders' memory or the RecordBatch/Table's memory (ideally 
that's the same thing?). I just haven't been able to figure out how to do that 
gracefully.

Thanks!
Chris Osborn

________________________________
From: Wes McKinney <wesmck...@gmail.com>
Sent: Thursday, June 25, 2020 10:13 PM
To: dev <dev@arrow.apache.org>
Subject: Re: Arrow for low-latency streaming of small batches?

Is it feasible to preallocate the memory region where you are writing
the record batch?

On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn <csosb...@gmail.com> wrote:
>
> Hi,
>
> I am investigating Arrow for a project that needs to transfer records from
> a producer to one or more consumers in small batches (median batch size is
> 1) and with low latency. The usual structure for something like this would
> be a single producer multi-consumer queue*. Is there any sane way to use
> Arrow in this fashion? I have a little C++ prototype that works, but it
> does the following for each batch of rows:
>
> Producer side:
>     1. construct a set of builders
>     2. append a value to each builder for each record in the batch
>     3. finish the builders and use them to make a RecordBatch
>     4. append the RecordBatch to a vector
>
> Consumer side:
>     1. construct a Table from the vector of RecordBatches
>     2. slice out the part of the table that the consumer requires (each
> consumer keeps its own offset)
>     3. read the data from the resulting sliced table
>
> Considering how much work this has to do it performs better than I would
> have expected, but there's definitely a big fixed cost for each batch of
> rows (constructing and destructing builders, making Tables that can only be
> used once since they're immutable, etc). If the batches weren't so small it
> would probably make sense, but as is it's unworkable. I need to add rows to
> logical "tables" thousands of times per second in aggregate.
>
> Am I just too far from Arrow's big data sweet spot, or is there something
> I'm missing? I keep reading about IPC and streaming of Arrow data, but I
> can't find a way to use it at such fine granularity. Thanks in advance for
> any insights!
>
> Thanks!
> Chris Osborn
>
>
> * yes, I can just use a queue, but the promise of a uniform memory layout
> that is simultaneously accessible to C++ and Python is very compelling

Reply via email to