Hi, I am investigating Arrow for a project that needs to transfer records from a producer to one or more consumers in small batches (median batch size is 1) and with low latency. The usual structure for something like this would be a single producer multi-consumer queue*. Is there any sane way to use Arrow in this fashion? I have a little C++ prototype that works, but it does the following for each batch of rows:
Producer side: 1. construct a set of builders 2. append a value to each builder for each record in the batch 3. finish the builders and use them to make a RecordBatch 4. append the RecordBatch to a vector Consumer side: 1. construct a Table from the vector of RecordBatches 2. slice out the part of the table that the consumer requires (each consumer keeps its own offset) 3. read the data from the resulting sliced table Considering how much work this has to do it performs better than I would have expected, but there's definitely a big fixed cost for each batch of rows (constructing and destructing builders, making Tables that can only be used once since they're immutable, etc). If the batches weren't so small it would probably make sense, but as is it's unworkable. I need to add rows to logical "tables" thousands of times per second in aggregate. Am I just too far from Arrow's big data sweet spot, or is there something I'm missing? I keep reading about IPC and streaming of Arrow data, but I can't find a way to use it at such fine granularity. Thanks in advance for any insights! Thanks! Chris Osborn * yes, I can just use a queue, but the promise of a uniform memory layout that is simultaneously accessible to C++ and Python is very compelling