Re: Writing very large rowgroups to Apache Parquet

Roman Karlstetter Mon, 13 Jul 2020 23:08:35 -0700

> I'd suggest a new write pattern. Write the columns page at a time to
separate files then use a second process to concatenate the columns and
append the footer. Odds are you would do better than os swapping and take
memory requirements down to
page size times field count.


This is exactly what a student of us implemented pretty successfully:
writing to one file per column (non-parquet, binary, memory-mapped). And
once enough data is put into those "cache/buffer-files", the data is
flushed to a parquet rowgroup.

My question targeted the integration of these ideas into the arrow parquet
writer. I wanted to know whether it makes sense to integrate these ideas or
whether it's better to keep that functionality outside of arrow/parquet.
Having it inside would have the benefit of reduced storage space because of
encoding/compression and thus smaller overhead in the final copy phase
(less data to copy and data already encoded/compressed). But on the other
hand, having one memory mapped file per column is not something that seems
to fit well with the current design of arrow.

Thanks for the feedback,
Roman

Am So., 12. Juli 2020 um 03:05 Uhr schrieb Micah Kornfield <
emkornfi...@gmail.com>:

> This is an interesting idea.  For s3 multipart uploads one might run into
> limitations pretty quickly (only 10k parts appear to be supported. all but
> the last are expected to be at least 5mb if I read their docs correctly
> [1])
>
> [1] https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
>
>
> On Saturday, July 11, 2020, Jacques Nadeau <jacq...@apache.org> wrote:
>
> > I'd suggest a new write pattern. Write the columns page at a time to
> > separate files then use a second process to concatenate the columns and
> > append the footer. Odds are you would do better than os swapping and take
> > memory requirements down to page size times field count.
> >
> > In s3 I believe you could do this via a multipart upload and entirely
> skip
> > the second step. I don't know of any implementations that actually do
> this
> > yet.
> >
> > On Thu, Jul 9, 2020, 11:58 PM Roman Karlstetter <
> > roman.karlstet...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I wasn't aware of the fact that jemalloc mmap automatically for larger
> >> allocations. And I didn't yet test this.
> >>
> >> The approach could be different in that we would know which parts of the
> >> buffers are going to be used next (the buffers are appendonly) and which
> >> parts won't be needed until actually flushing the rowgroup (and when
> >> flushing, we also know the order). But I'm not sure whether that
> knowledge
> >> helps a lot in a) saving memory compared to a generic allocator or b)
> >> improving performance. In addition to that, communicating this knowledge
> >> to
> >> the implementation will also be tricky for the general case, I guess.
> >>
> >> Regarding setting the allocator to another memory pool: I was unsure
> >> whether the memory pool is used for further allocations where the
> default
> >> memory pool would be more appropriate. If not, then setting the memory
> >> pool
> >> in the writer properties should actually work well.
> >>
> >> Maybe I should just play a bit with the different memory pool options
> and
> >> see how they behave. It makes more sense to discuss further ideas once I
> >> have some performance numbers.
> >>
> >> Thanks,
> >> Roman
> >>
> >>
> >> Am Fr., 10. Juli 2020 um 06:47 Uhr schrieb Micah Kornfield <
> >> emkornfi...@gmail.com>:
> >>
> >> > +parquet-dev as this seems more concerned with the non-arrow pieces of
> >> > parquet
> >> >
> >> > Hi Roman,
> >> > Answers inline.
> >> >
> >> > One way to solve that problem would be to use memory mapped files
> >> instead
> >> > > of plain memory buffers. That way, the number of required memory can
> >> be
> >> > > limited by the number of columns times the os-pagesize, which would
> be
> >> > > independent of the rowgroup-size. Consequently, large rowgroupsizes
> >> pose
> >> > no
> >> > > problem with respect to RAM consumption.
> >> >
> >> > I was under the impression that modern allocator (i.e. jemalloc)
> already
> >> > mmap for large allocations.  How would this approach be different from
> >> the
> >> > way allocators use it?  Have you prototyped this approach to see if it
> >> > allows for better scalability?
> >> >
> >> >
> >> > > After a quick look at how the buffers are managed inside arrow
> >> (allocated
> >> > > from a default memory pool), I have the impression that an
> >> implementation
> >> > > of this idea could be a rather huge change. I still wanted to know
> >> > whether
> >> > > that is something you could see being integrated or whether that is
> >> out
> >> > of
> >> > > scope of arrow.
> >> >
> >> >
> >> > A huge change probably isn't a great idea unless we've validated the
> >> > approach along with alternatives.  Is there currently code that
> doesn't
> >> > make use of the MemoryPool [1] provided by WriterProperties? If so we
> >> > should probably fix it.  Otherwise, is there a reason that you can't
> >> > substitute a customized memory pool on WriterProperties?
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> > [1]
> >> >
> >> > https://github.com/apache/arrow/blob/5602c459eb8773b6be8059b1b11817
> >> 5e9f16b7a3/cpp/src/parquet/properties.h#L447
> >> >
> >> > On Thu, Jul 9, 2020 at 8:35 AM Roman Karlstetter <
> >> > roman.karlstet...@gmail.com> wrote:
> >> >
> >> > > Hi everyone,
> >> > >
> >> > > since some time now, parquet::ParquetFileWriter has the option to
> >> create
> >> > > buffered rowgroups with AppendBufferedRowGroup(), which basically
> >> gives
> >> > you
> >> > > the possibility to write to columns in any order you like (in
> >> contrast to
> >> > > the former only possible way of writing one column after the other).
> >> This
> >> > > is cool since it avoids the caller from having to create an in
> memory
> >> > > columnar representation of its data.
> >> > >
> >> > > However, when data size is huge compared to the available system
> >> memory
> >> > > (due to wide schema or a large rowgroupsize), this is problematic,
> as
> >> the
> >> > > buffers allocated internally can take up a large portion of RAM of
> the
> >> > > machine the conversion is running on.
> >> > >
> >> > > One way to solve that problem would be to use memory mapped files
> >> instead
> >> > > of plain memory buffers. That way, the number of required memory can
> >> be
> >> > > limited by the number of columns times the os-pagesize, which would
> be
> >> > > independent of the rowgroup-size. Consequently, large rowgroupsizes
> >> pose
> >> > no
> >> > > problem with respect to RAM consumption.
> >> > >
> >> > > I wonder what you generally think about the idea of integrating an
> >> > > AppendFileBufferedRowGroup() (or similar name) possibility which
> gives
> >> > the
> >> > > user the option to have the internal buffers be memory mapped files.
> >> > >
> >> > > After a quick look at how the buffers are managed inside arrow
> >> (allocated
> >> > > from a default memory pool), I have the impression that an
> >> implementation
> >> > > of this idea could be a rather huge change. I still wanted to know
> >> > whether
> >> > > that is something you could see being integrated or whether that is
> >> out
> >> > of
> >> > > scope of arrow.
> >> > >
> >> > > Thanks in advance and kind regards,
> >> > > Roman
> >> > >
> >> >
> >>
> >
>

Re: Writing very large rowgroups to Apache Parquet

Reply via email to