Hi all, I had a question about memory usage in ParquetFileReader, particularly in #readNextRowGroup <https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937> /#readNextFilteredRowGroup <https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076>. >From what I can tell, these methods will enumerate all column chunks in the row group, then for each chunk, fully read all pages in the chunk.
I've been encountering memory issues performing heavy reads of Parquet data, particularly use cases that require the colocation of multiple Parquet files on a single worker. In cases like these, a single worker may be reading dozens or hundreds of Parquet files, and trying to materialize row groups is causing OOMs, even with tweaked row group size. I'm wondering if there's any way to avoid materializing the entire row group at once, and instead materialize pages on an as-needed basis (along with dictionary encoding etc when we start on a new chunk). Looking through the ParquetFileReader code, a possible solution could be to re-implement pagesInChunk <https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552> as an Iterator<DataPage> rather than a List<DataPage>, and modify ColumnChunkPageReader to support a lazy Collection of data pages? Let me know what you think! It's possible that I'm misunderstanding how readNextRowGroup works -- Parquet internals are a steep learning curve :) Best, Claire