Parquet reader memory usage

Will Jones Tue, 09 Aug 2022 16:11:36 -0700

I did some experiments to try to understand what controls a user has to
constrain how much memory our Parquet readers use, at least as measured by
the memory pools max_memory() method.


I was surprised to find that parquet::ArrowReaderProperties.batch_size
didn't have much of an effect at all on the peak memory usage [1]. The code
I ran was [2].

Two questions:

1. Is this expected? Or does it sound like I did something wrong?
2. Is there a way we could make it so that setting a smaller batch size
reduced the memory required to read into a record batch stream?

I created a repo for these tests at [3].

[1]
https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/readme_files/figure-gfm/group-size-1.png
[2]
https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/src/main.cc#L51-L66
[3] https://github.com/wjones127/arrow-parquet-memory-bench

Parquet reader memory usage

Reply via email to