I did some experiments to try to understand what controls a user has to constrain how much memory our Parquet readers use, at least as measured by the memory pools max_memory() method.
I was surprised to find that parquet::ArrowReaderProperties.batch_size didn't have much of an effect at all on the peak memory usage [1]. The code I ran was [2]. Two questions: 1. Is this expected? Or does it sound like I did something wrong? 2. Is there a way we could make it so that setting a smaller batch size reduced the memory required to read into a record batch stream? I created a repo for these tests at [3]. [1] https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/readme_files/figure-gfm/group-size-1.png [2] https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/src/main.cc#L51-L66 [3] https://github.com/wjones127/arrow-parquet-memory-bench