Great, makes sense Gabor! Perhaps this could even be implemented via an Integer Configuration value for how many pages, or page bytes, to buffer at a time, so that users can balance IO speed with memory usage. I'll try out a few approaches and aim to update this thread when I have something.
Best, Claire On Tue, Mar 5, 2024 at 2:55 AM Gábor Szádovszky <ga...@apache.org> wrote: > Hi Claire, > > I think you read it correctly. Your proposal sounds good to me but you need > to make it a separate way of reading instead of rewriting the current > behavior. The current implementation figures out the consecutive parts in > the file (multiple pages or even column chunks written after each other) > and reads them in one attempt. This way the I/O is faster. Meanwhile, your > concerns are also completely valid so reading the pages lazily as they are > needed saves memory. It should be up to the API client to choose between > the solutions. > Since we already have the interfaces that we can hide our logic behind > (PageReadStore/PageReader), probably the best way would be introducing an > additional configuration that allows lazy reading behind the scenes. > > Cheers, > Gabor > > Claire McGinty <claire.d.mcgi...@gmail.com> ezt írta (időpont: 2024. márc. > 4., H, 21:04): > > > Hi all, > > > > I had a question about memory usage in ParquetFileReader, particularly in > > #readNextRowGroup > > < > > > https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937 > > > > > /#readNextFilteredRowGroup > > < > > > https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076 > > >. > > From what I can tell, these methods will enumerate all column chunks in > > the row group, then for each chunk, fully read all pages in the chunk. > > > > I've been encountering memory issues performing heavy reads of Parquet > > data, particularly use cases that require the colocation of multiple > > Parquet files on a single worker. In cases like these, a single worker > may > > be reading dozens or hundreds of Parquet files, and trying to materialize > > row groups is causing OOMs, even with tweaked row group size. > > > > I'm wondering if there's any way to avoid materializing the entire row > > group at once, and instead materialize pages on an as-needed basis (along > > with dictionary encoding etc when we start on a new chunk). Looking > through > > the ParquetFileReader code, a possible solution could be to re-implement > > pagesInChunk > > < > > > https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552 > > > > > as > > an Iterator<DataPage> rather than a List<DataPage>, and modify > > ColumnChunkPageReader to support a lazy Collection of data pages? > > > > Let me know what you think! It's possible that I'm misunderstanding how > > readNextRowGroup works -- Parquet internals are a steep learning curve :) > > > > Best, > > Claire > > >