Is it feasible to preallocate the memory region where you are writing
the record batch?

On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn <csosb...@gmail.com> wrote:
>
> Hi,
>
> I am investigating Arrow for a project that needs to transfer records from
> a producer to one or more consumers in small batches (median batch size is
> 1) and with low latency. The usual structure for something like this would
> be a single producer multi-consumer queue*. Is there any sane way to use
> Arrow in this fashion? I have a little C++ prototype that works, but it
> does the following for each batch of rows:
>
> Producer side:
>     1. construct a set of builders
>     2. append a value to each builder for each record in the batch
>     3. finish the builders and use them to make a RecordBatch
>     4. append the RecordBatch to a vector
>
> Consumer side:
>     1. construct a Table from the vector of RecordBatches
>     2. slice out the part of the table that the consumer requires (each
> consumer keeps its own offset)
>     3. read the data from the resulting sliced table
>
> Considering how much work this has to do it performs better than I would
> have expected, but there's definitely a big fixed cost for each batch of
> rows (constructing and destructing builders, making Tables that can only be
> used once since they're immutable, etc). If the batches weren't so small it
> would probably make sense, but as is it's unworkable. I need to add rows to
> logical "tables" thousands of times per second in aggregate.
>
> Am I just too far from Arrow's big data sweet spot, or is there something
> I'm missing? I keep reading about IPC and streaming of Arrow data, but I
> can't find a way to use it at such fine granularity. Thanks in advance for
> any insights!
>
> Thanks!
> Chris Osborn
>
>
> * yes, I can just use a queue, but the promise of a uniform memory layout
> that is simultaneously accessible to C++ and Python is very compelling

Reply via email to