Hi,

I am investigating Arrow for a project that needs to transfer records from
a producer to one or more consumers in small batches (median batch size is
1) and with low latency. The usual structure for something like this would
be a single producer multi-consumer queue*. Is there any sane way to use
Arrow in this fashion? I have a little C++ prototype that works, but it
does the following for each batch of rows:

Producer side:
    1. construct a set of builders
    2. append a value to each builder for each record in the batch
    3. finish the builders and use them to make a RecordBatch
    4. append the RecordBatch to a vector

Consumer side:
    1. construct a Table from the vector of RecordBatches
    2. slice out the part of the table that the consumer requires (each
consumer keeps its own offset)
    3. read the data from the resulting sliced table

Considering how much work this has to do it performs better than I would
have expected, but there's definitely a big fixed cost for each batch of
rows (constructing and destructing builders, making Tables that can only be
used once since they're immutable, etc). If the batches weren't so small it
would probably make sense, but as is it's unworkable. I need to add rows to
logical "tables" thousands of times per second in aggregate.

Am I just too far from Arrow's big data sweet spot, or is there something
I'm missing? I keep reading about IPC and streaming of Arrow data, but I
can't find a way to use it at such fine granularity. Thanks in advance for
any insights!

Thanks!
Chris Osborn


* yes, I can just use a queue, but the promise of a uniform memory layout
that is simultaneously accessible to C++ and Python is very compelling

Reply via email to