Just a few additional thoughts: > at least as measured by > the memory pools max_memory() method.
The parquet reader does a fair amount of allocation on the global system allocator (i.e. not using a memory pool). Typically this should be small in comparison with the data buffers themselves (which will be allocated on memory pools) but it might be worth just inspecting the RSS usage of your benchmarks. > (1) Turn off prebuffering, (2) Read data in batches, and > (3) turn on buffered_stream. This might go without saying but if we're going to include it in our docs we should recommend users also measure the performance impact of these changes. > If there's no further input If the goal is to reduce memory then users might want to also think about dictionary encoding for string/binary columns. I'm not entirely sure how the properties work but I think you can force Arrow to read certain columns as dictionary encoded (I could be very wrong here). On Thu, Aug 11, 2022 at 12:26 PM Will Jones <will.jones...@gmail.com> wrote: > > Hi all, > > I found my issue: I was not actually passing down the > ArrowReaderProperties. I can now see that lowering batch_size meaningfully > reduces memory usage [1]. I still see more memory used when reading files > with larger row groups, keeping the batch size constant. > > Overall I found that users who want to keep memory usage down when reading > Parquet should: (1) Turn off prebuffering, (2) Read data in batches, and > (3) turn on buffered_stream. > > If there's no further input, I may add these suggestions to our docs. > > [1] > https://github.com/wjones127/arrow-parquet-memory-bench/blob/7e0d740a09c8042da647a0de1f285b6bb8a7f4db/readme_files/figure-gfm/group-size-1.png > > On Tue, Aug 9, 2022 at 4:11 PM Will Jones <will.jones...@gmail.com> wrote: > > > I did some experiments to try to understand what controls a user has to > > constrain how much memory our Parquet readers use, at least as measured by > > the memory pools max_memory() method. > > > > I was surprised to find that parquet::ArrowReaderProperties.batch_size > > didn't have much of an effect at all on the peak memory usage [1]. The code > > I ran was [2]. > > > > Two questions: > > > > 1. Is this expected? Or does it sound like I did something wrong? > > 2. Is there a way we could make it so that setting a smaller batch size > > reduced the memory required to read into a record batch stream? > > > > I created a repo for these tests at [3]. > > > > [1] > > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/readme_files/figure-gfm/group-size-1.png > > [2] > > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/src/main.cc#L51-L66 > > [3] https://github.com/wjones127/arrow-parquet-memory-bench > >