This might help to get the size of the output buffer upfront:
https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96

Though with "standard" allocators there is a risk of running into
KiPageFaults when going for buffers over 1mb. This might be especially
painful in multithreaded environment.

A custom outputstream with configurable buffering parameter might help to
overcome that problem without dealing too much with the allocators.
Curious to hear community thoughts on this.

Cheers,
Gosh

On Fri., 11 Jun. 2021, 00:45 Wes McKinney, <wesmck...@gmail.com> wrote:

> From this, it seems like seeding the RecordBatchStreamWriter's output
> stream with a much larger preallocated buffer would improve
> performance (depends on the allocator used of course).
>
> On Thu, Jun 10, 2021 at 5:40 PM Weston Pace <weston.p...@gmail.com> wrote:
> >
> > Just for some reference times from my system I created a quick test to
> > dump a ~1.7GB table to buffer(s).
> >
> > Going to many buffers (just collecting the buffers): ~11,000ns
> > Going to one preallocated buffer: ~160,000,000ns
> > Going to one dynamically allocated buffer (using a grow factor of 2x):
> > ~2,000,000,000ns
> >
> > On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > To be clear, we would like to help make this faster. I don't recall
> > > much effort being invested in optimizing this code path in the last
> > > couple of years, so there may be some low hanging fruit to improve the
> > > performance. Changing the in-memory data layout (the chunking) is one
> > > of the most likely things to help.
> > >
> > > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <gosh...@gmail.com>
> wrote:
> > > >
> > > > Hi Jayjeet,
> > > >
> > > > I wonder if you really need to serialize the whole table into a
> single
> > > > buffer as you will end up with twice the memory while you could be
> sending
> > > > chunks as they are generated by the  RecordBatchStreamWriter. Also
> is the
> > > > buffer resized beforehand? I'd suspect there might be relocations
> happening
> > > > under the hood.
> > > >
> > > >
> > > > Cheers,
> > > > Gosh
> > > >
> > > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <wesmck...@gmail.com>
> wrote:
> > > >
> > > > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > > > spent? How many arrays (the sum of the number of chunks across all
> > > > > columns) in total are there? I would guess that the problem is all
> the
> > > > > little Buffer memcopies. I don't think that the C Interface is
> going
> > > > > to help you.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > > > <jayjeetchakrabort...@gmail.com> wrote:
> > > > > >
> > > > > > Hello Arrow Community,
> > > > > >
> > > > > > I am a student working on a project where I need to serialize an
> > > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I
> am
> > > > > currently using the arrow::ipc::RecordBatchStreamWriter API to
> serialize
> > > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to
> serialize
> > > > > the whole table, and that is harming the performance of my
> > > > > performance-critical application. I basically want to get hold of
> the
> > > > > underlying memory of the table as bytes and send it over the
> network. How
> > > > > do you suggest I tackle this problem? I was thinking of using the
> C Data
> > > > > interface for this, so that I convert my arrow::Table to
> ArrowArray and
> > > > > ArrowSchema and serialize the structs to send them over the
> network, but
> > > > > seems like serializing structs is another complex problem on its
> own.  It
> > > > > will be great to have some suggestions on this. Thanks a lot.
> > > > > >
> > > > > > Best,
> > > > > > Jayjeet
> > > > > >
> > > > >
>

Reply via email to