Very interesting. This is something that I would potentially also be interested in, so if there were some code available out there, I could potentially contribute or at least use. At least, I'd love for something that allows Arrow to work with both larger and very small record batches (a few rows) in a seamless and efficient way to make it into the Arrow codebase.
Le lun. 29 juin 2020, à 17 h 05, Wes McKinney <wesmck...@gmail.com> a écrit : > On Fri, Jun 26, 2020 at 8:56 AM Chris Osborn <csosb...@fb.com.invalid> > wrote: > > > > Yes, it would be quite feasible to preallocate a region large enough for > several thousand rows for each column, assuming I read from that region > while it's still filling in. When that region is full, I could either > allocate a new big chunk or loop around if I no longer need the data. I'm > now doing something like that in a revised prototype. Specifically I'm > creating builders and calling Reserve() once up front to get a large > region, which I then fill in with multiple batches. As the producer fills > it in using ArrayBuilder::Append(), the consumers read out earlier rows > using ArrayBuilder::GetValue(). This works, but I'm clearly going against > the spirit of the library by using builders as ersatz Arrays and a set of > builders in lieu of a Table. > > > > In short, it's feasible (and preferable) to preallocate the memory > needed, whether it's the builders' memory or the RecordBatch/Table's memory > (ideally that's the same thing?). I just haven't been able to figure out > how to do that gracefully. > > By following the columnar format's buffer layouts [1] it should > straightforward to compute the size of a memory region to preallocate > that represents a RecordBatch's memory and then construct the Buffer > and ArrayData objects that reference each constituent buffer, and then > create a RecordBatch from those ArrayData objects. Some assumptions > must be made of course: > > * If a field is nullable, then an empty validity bitmap must be > preallocated (and you can initialize it to all valid or all null based > on what your application prefers) > * Must decide what to do about variable-size allocations for > binary/string types (and extrapolating, analogously for list types if > you have Array/List-like data). So if you preallocated a region that > can accommodate 1024 values then you might allocate 32KB data buffers > for string data (or some factor of the length if you have bigger > strings). If you fill up the data buffer then you will have to move on > to the next region. Another approach might be to let the string data > buffer be a separate ResizableBuffer that you reallocate when you need > to make it bigger > > I could envision creating a C++ implementation to manage this whole > process that becomes a part of the Arrow C++ codebase -- preallocate > memory given some global / field-level options and then provide > effectively "UnsafeAppend" APIs to write data into the preallocated > region. > > If you create a "parent" RecordBatch that references the preallocated > memory than you can use `RecordBatch::Slice` to "chop" off the filled > portion to pass to your consumer. > > [1]: > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#buffer-listing-for-each-layout > > > Thanks! > > Chris Osborn > > > > ________________________________ > > From: Wes McKinney <wesmck...@gmail.com> > > Sent: Thursday, June 25, 2020 10:13 PM > > To: dev <dev@arrow.apache.org> > > Subject: Re: Arrow for low-latency streaming of small batches? > > > > Is it feasible to preallocate the memory region where you are writing > > the record batch? > > > > On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn <csosb...@gmail.com> wrote: > > > > > > Hi, > > > > > > I am investigating Arrow for a project that needs to transfer records > from > > > a producer to one or more consumers in small batches (median batch > size is > > > 1) and with low latency. The usual structure for something like this > would > > > be a single producer multi-consumer queue*. Is there any sane way to > use > > > Arrow in this fashion? I have a little C++ prototype that works, but it > > > does the following for each batch of rows: > > > > > > Producer side: > > > 1. construct a set of builders > > > 2. append a value to each builder for each record in the batch > > > 3. finish the builders and use them to make a RecordBatch > > > 4. append the RecordBatch to a vector > > > > > > Consumer side: > > > 1. construct a Table from the vector of RecordBatches > > > 2. slice out the part of the table that the consumer requires (each > > > consumer keeps its own offset) > > > 3. read the data from the resulting sliced table > > > > > > Considering how much work this has to do it performs better than I > would > > > have expected, but there's definitely a big fixed cost for each batch > of > > > rows (constructing and destructing builders, making Tables that can > only be > > > used once since they're immutable, etc). If the batches weren't so > small it > > > would probably make sense, but as is it's unworkable. I need to add > rows to > > > logical "tables" thousands of times per second in aggregate. > > > > > > Am I just too far from Arrow's big data sweet spot, or is there > something > > > I'm missing? I keep reading about IPC and streaming of Arrow data, but > I > > > can't find a way to use it at such fine granularity. Thanks in advance > for > > > any insights! > > > > > > Thanks! > > > Chris Osborn > > > > > > > > > * yes, I can just use a queue, but the promise of a uniform memory > layout > > > that is simultaneously accessible to C++ and Python is very compelling > -- │ Christian Hudon │ Applied Research Scientist Element AI, 6650 Saint-Urbain #500 Montréal, QC, H2S 3G9, Canada Elementai.com