Question about read granularity in ParquetFileReader

Claire McGinty Mon, 04 Mar 2024 12:03:57 -0800

Hi all,

I had a question about memory usage in ParquetFileReader, particularly in
#readNextRowGroup
<https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937>
/#readNextFilteredRowGroup
<https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076>.
>From what I can tell, these methods will enumerate all column chunks in
the row group, then for each chunk, fully read all pages in the chunk.


I've been encountering memory issues performing heavy reads of Parquet
data, particularly use cases that require the colocation of multiple
Parquet files on a single worker. In cases like these, a single worker may
be reading dozens or hundreds of Parquet files, and trying to materialize
row groups is causing OOMs, even with tweaked row group size.

I'm wondering if there's any way to avoid materializing the entire row
group at once, and instead materialize pages on an as-needed basis (along
with dictionary encoding etc when we start on a new chunk). Looking through
the ParquetFileReader code, a possible solution could be to re-implement
pagesInChunk
<https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552>
as
an Iterator<DataPage> rather than a List<DataPage>, and modify
ColumnChunkPageReader to support a lazy Collection of data pages?

Let me know what you think! It's possible that I'm misunderstanding how
readNextRowGroup works -- Parquet internals are a steep learning curve :)

Best,
Claire

Question about read granularity in ParquetFileReader

Reply via email to