Is it feasible to preallocate the memory region where you are writing the record batch?
On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn <csosb...@gmail.com> wrote: > > Hi, > > I am investigating Arrow for a project that needs to transfer records from > a producer to one or more consumers in small batches (median batch size is > 1) and with low latency. The usual structure for something like this would > be a single producer multi-consumer queue*. Is there any sane way to use > Arrow in this fashion? I have a little C++ prototype that works, but it > does the following for each batch of rows: > > Producer side: > 1. construct a set of builders > 2. append a value to each builder for each record in the batch > 3. finish the builders and use them to make a RecordBatch > 4. append the RecordBatch to a vector > > Consumer side: > 1. construct a Table from the vector of RecordBatches > 2. slice out the part of the table that the consumer requires (each > consumer keeps its own offset) > 3. read the data from the resulting sliced table > > Considering how much work this has to do it performs better than I would > have expected, but there's definitely a big fixed cost for each batch of > rows (constructing and destructing builders, making Tables that can only be > used once since they're immutable, etc). If the batches weren't so small it > would probably make sense, but as is it's unworkable. I need to add rows to > logical "tables" thousands of times per second in aggregate. > > Am I just too far from Arrow's big data sweet spot, or is there something > I'm missing? I keep reading about IPC and streaming of Arrow data, but I > can't find a way to use it at such fine granularity. Thanks in advance for > any insights! > > Thanks! > Chris Osborn > > > * yes, I can just use a queue, but the promise of a uniform memory layout > that is simultaneously accessible to C++ and Python is very compelling