Re: Parquet reader memory usage

Weston Pace Thu, 11 Aug 2022 20:06:50 -0700

Just a few additional thoughts:

> at least as measured by
> the memory pools max_memory() method.


The parquet reader does a fair amount of allocation on the global
system allocator (i.e. not using a memory pool).  Typically this
should be small in comparison with the data buffers themselves (which
will be allocated on memory pools) but it might be worth just
inspecting the RSS usage of your benchmarks.

> (1) Turn off prebuffering, (2) Read data in batches, and
> (3) turn on buffered_stream.

This might go without saying but if we're going to include it in our
docs we should recommend users also measure the performance impact of
these changes.

> If there's no further input

If the goal is to reduce memory then users might want to also think
about dictionary encoding for string/binary columns.  I'm not entirely
sure how the properties work but I think you can force Arrow to read
certain columns as dictionary encoded (I could be very wrong here).

On Thu, Aug 11, 2022 at 12:26 PM Will Jones <will.jones...@gmail.com> wrote:
>
> Hi all,
>
> I found my issue: I was not actually passing down the
> ArrowReaderProperties. I can now see that lowering batch_size meaningfully
> reduces memory usage [1]. I still see more memory used when reading files
> with larger row groups, keeping the batch size constant.
>
> Overall I found that users who want to keep memory usage down when reading
> Parquet should: (1) Turn off prebuffering, (2) Read data in batches, and
> (3) turn on buffered_stream.
>
> If there's no further input, I may add these suggestions to our docs.
>
> [1]
> https://github.com/wjones127/arrow-parquet-memory-bench/blob/7e0d740a09c8042da647a0de1f285b6bb8a7f4db/readme_files/figure-gfm/group-size-1.png
>
> On Tue, Aug 9, 2022 at 4:11 PM Will Jones <will.jones...@gmail.com> wrote:
>
> > I did some experiments to try to understand what controls a user has to
> > constrain how much memory our Parquet readers use, at least as measured by
> > the memory pools max_memory() method.
> >
> > I was surprised to find that parquet::ArrowReaderProperties.batch_size
> > didn't have much of an effect at all on the peak memory usage [1]. The code
> > I ran was [2].
> >
> > Two questions:
> >
> > 1. Is this expected? Or does it sound like I did something wrong?
> > 2. Is there a way we could make it so that setting a smaller batch size
> > reduced the memory required to read into a record batch stream?
> >
> > I created a repo for these tests at [3].
> >
> > [1]
> > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/readme_files/figure-gfm/group-size-1.png
> > [2]
> > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/src/main.cc#L51-L66
> > [3] https://github.com/wjones127/arrow-parquet-memory-bench
> >

Re: Parquet reader memory usage

Reply via email to