Very interesting. This is something that I would potentially also be
interested in, so if there were some code available out there, I could
potentially contribute or at least use. At least, I'd love for something
that allows Arrow to work with both larger and very small record batches (a
few rows) in a seamless and efficient way to make it into the Arrow
codebase.

Le lun. 29 juin 2020, à 17 h 05, Wes McKinney <wesmck...@gmail.com> a
écrit :

> On Fri, Jun 26, 2020 at 8:56 AM Chris Osborn <csosb...@fb.com.invalid>
> wrote:
> >
> > Yes, it would be quite feasible to preallocate a region large enough for
> several thousand rows for each column, assuming I read from that region
> while it's still filling in. When that region is full, I could either
> allocate a new big chunk or loop around if I no longer need the data. I'm
> now doing something like that in a revised prototype. Specifically I'm
> creating builders and calling Reserve() once up front to get a large
> region, which I then fill in with multiple batches. As the producer fills
> it in using ArrayBuilder::Append(), the consumers read out earlier rows
> using ArrayBuilder::GetValue(). This works, but I'm clearly going against
> the spirit of the library by using builders as ersatz Arrays and a set of
> builders in lieu of a Table.
> >
> > In short, it's feasible (and preferable) to preallocate the memory
> needed, whether it's the builders' memory or the RecordBatch/Table's memory
> (ideally that's the same thing?). I just haven't been able to figure out
> how to do that gracefully.
>
> By following the columnar format's buffer layouts [1] it should
> straightforward to compute the size of a memory region to preallocate
> that represents a RecordBatch's memory and then construct the Buffer
> and ArrayData objects that reference each constituent buffer, and then
> create a RecordBatch from those ArrayData objects. Some assumptions
> must be made of course:
>
> * If a field is nullable, then an empty validity bitmap must be
> preallocated (and you can initialize it to all valid or all null based
> on what your application prefers)
> * Must decide what to do about variable-size allocations for
> binary/string types (and extrapolating, analogously for list types if
> you have Array/List-like data). So if you preallocated a region that
> can accommodate 1024 values then you might allocate 32KB data buffers
> for string data (or some factor of the length if you have bigger
> strings). If you fill up the data buffer then you will have to move on
> to the next region. Another approach might be to let the string data
> buffer be a separate ResizableBuffer that you reallocate when you need
> to make it bigger
>
> I could envision creating a C++ implementation to manage this whole
> process that becomes a part of the Arrow C++ codebase -- preallocate
> memory given some global / field-level options and then provide
> effectively "UnsafeAppend" APIs to write data into the preallocated
> region.
>
> If you create a "parent" RecordBatch that references the preallocated
> memory than you can use `RecordBatch::Slice` to "chop" off the filled
> portion to pass to your consumer.
>
> [1]:
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#buffer-listing-for-each-layout
>
> > Thanks!
> > Chris Osborn
> >
> > ________________________________
> > From: Wes McKinney <wesmck...@gmail.com>
> > Sent: Thursday, June 25, 2020 10:13 PM
> > To: dev <dev@arrow.apache.org>
> > Subject: Re: Arrow for low-latency streaming of small batches?
> >
> > Is it feasible to preallocate the memory region where you are writing
> > the record batch?
> >
> > On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn <csosb...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I am investigating Arrow for a project that needs to transfer records
> from
> > > a producer to one or more consumers in small batches (median batch
> size is
> > > 1) and with low latency. The usual structure for something like this
> would
> > > be a single producer multi-consumer queue*. Is there any sane way to
> use
> > > Arrow in this fashion? I have a little C++ prototype that works, but it
> > > does the following for each batch of rows:
> > >
> > > Producer side:
> > >     1. construct a set of builders
> > >     2. append a value to each builder for each record in the batch
> > >     3. finish the builders and use them to make a RecordBatch
> > >     4. append the RecordBatch to a vector
> > >
> > > Consumer side:
> > >     1. construct a Table from the vector of RecordBatches
> > >     2. slice out the part of the table that the consumer requires (each
> > > consumer keeps its own offset)
> > >     3. read the data from the resulting sliced table
> > >
> > > Considering how much work this has to do it performs better than I
> would
> > > have expected, but there's definitely a big fixed cost for each batch
> of
> > > rows (constructing and destructing builders, making Tables that can
> only be
> > > used once since they're immutable, etc). If the batches weren't so
> small it
> > > would probably make sense, but as is it's unworkable. I need to add
> rows to
> > > logical "tables" thousands of times per second in aggregate.
> > >
> > > Am I just too far from Arrow's big data sweet spot, or is there
> something
> > > I'm missing? I keep reading about IPC and streaming of Arrow data, but
> I
> > > can't find a way to use it at such fine granularity. Thanks in advance
> for
> > > any insights!
> > >
> > > Thanks!
> > > Chris Osborn
> > >
> > >
> > > * yes, I can just use a queue, but the promise of a uniform memory
> layout
> > > that is simultaneously accessible to C++ and Python is very compelling
>


-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com

Reply via email to