Yes, it would be quite feasible to preallocate a region large enough for several thousand rows for each column, assuming I read from that region while it's still filling in. When that region is full, I could either allocate a new big chunk or loop around if I no longer need the data. I'm now doing something like that in a revised prototype. Specifically I'm creating builders and calling Reserve() once up front to get a large region, which I then fill in with multiple batches. As the producer fills it in using ArrayBuilder::Append(), the consumers read out earlier rows using ArrayBuilder::GetValue(). This works, but I'm clearly going against the spirit of the library by using builders as ersatz Arrays and a set of builders in lieu of a Table.
In short, it's feasible (and preferable) to preallocate the memory needed, whether it's the builders' memory or the RecordBatch/Table's memory (ideally that's the same thing?). I just haven't been able to figure out how to do that gracefully. Thanks! Chris Osborn ________________________________ From: Wes McKinney <wesmck...@gmail.com> Sent: Thursday, June 25, 2020 10:13 PM To: dev <dev@arrow.apache.org> Subject: Re: Arrow for low-latency streaming of small batches? Is it feasible to preallocate the memory region where you are writing the record batch? On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn <csosb...@gmail.com> wrote: > > Hi, > > I am investigating Arrow for a project that needs to transfer records from > a producer to one or more consumers in small batches (median batch size is > 1) and with low latency. The usual structure for something like this would > be a single producer multi-consumer queue*. Is there any sane way to use > Arrow in this fashion? I have a little C++ prototype that works, but it > does the following for each batch of rows: > > Producer side: > 1. construct a set of builders > 2. append a value to each builder for each record in the batch > 3. finish the builders and use them to make a RecordBatch > 4. append the RecordBatch to a vector > > Consumer side: > 1. construct a Table from the vector of RecordBatches > 2. slice out the part of the table that the consumer requires (each > consumer keeps its own offset) > 3. read the data from the resulting sliced table > > Considering how much work this has to do it performs better than I would > have expected, but there's definitely a big fixed cost for each batch of > rows (constructing and destructing builders, making Tables that can only be > used once since they're immutable, etc). If the batches weren't so small it > would probably make sense, but as is it's unworkable. I need to add rows to > logical "tables" thousands of times per second in aggregate. > > Am I just too far from Arrow's big data sweet spot, or is there something > I'm missing? I keep reading about IPC and streaming of Arrow data, but I > can't find a way to use it at such fine granularity. Thanks in advance for > any insights! > > Thanks! > Chris Osborn > > > * yes, I can just use a queue, but the promise of a uniform memory layout > that is simultaneously accessible to C++ and Python is very compelling