Sounds good! I created PARQUET-2443 <https://issues.apache.org/jira/browse/PARQUET-2443>.
Best, Claire On Tue, Mar 5, 2024 at 8:43 AM Gábor Szádovszky <gabor.szadovs...@gmail.com> wrote: > Hi Claire, > > I think it would be better to continue the discussion in a related jira or > even PR. > > Cheers, > Gabor > > Claire McGinty <claire.d.mcgi...@gmail.com> ezt írta (időpont: 2024. márc. > 5., K, 14:09): > > > Great, makes sense Gabor! > > > > Perhaps this could even be implemented via an Integer Configuration value > > for how many pages, or page bytes, to buffer at a time, so that users can > > balance IO speed with memory usage. I'll try out a few approaches and aim > > to update this thread when I have something. > > > > Best, > > Claire > > > > > > > > On Tue, Mar 5, 2024 at 2:55 AM Gábor Szádovszky <ga...@apache.org> > wrote: > > > > > Hi Claire, > > > > > > I think you read it correctly. Your proposal sounds good to me but you > > need > > > to make it a separate way of reading instead of rewriting the current > > > behavior. The current implementation figures out the consecutive parts > in > > > the file (multiple pages or even column chunks written after each > other) > > > and reads them in one attempt. This way the I/O is faster. Meanwhile, > > your > > > concerns are also completely valid so reading the pages lazily as they > > are > > > needed saves memory. It should be up to the API client to choose > between > > > the solutions. > > > Since we already have the interfaces that we can hide our logic behind > > > (PageReadStore/PageReader), probably the best way would be introducing > an > > > additional configuration that allows lazy reading behind the scenes. > > > > > > Cheers, > > > Gabor > > > > > > Claire McGinty <claire.d.mcgi...@gmail.com> ezt írta (időpont: 2024. > > márc. > > > 4., H, 21:04): > > > > > > > Hi all, > > > > > > > > I had a question about memory usage in ParquetFileReader, > particularly > > in > > > > #readNextRowGroup > > > > < > > > > > > > > > > https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937 > > > > > > > > > /#readNextFilteredRowGroup > > > > < > > > > > > > > > > https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076 > > > > >. > > > > From what I can tell, these methods will enumerate all column chunks > in > > > > the row group, then for each chunk, fully read all pages in the > chunk. > > > > > > > > I've been encountering memory issues performing heavy reads of > Parquet > > > > data, particularly use cases that require the colocation of multiple > > > > Parquet files on a single worker. In cases like these, a single > worker > > > may > > > > be reading dozens or hundreds of Parquet files, and trying to > > materialize > > > > row groups is causing OOMs, even with tweaked row group size. > > > > > > > > I'm wondering if there's any way to avoid materializing the entire > row > > > > group at once, and instead materialize pages on an as-needed basis > > (along > > > > with dictionary encoding etc when we start on a new chunk). Looking > > > through > > > > the ParquetFileReader code, a possible solution could be to > > re-implement > > > > pagesInChunk > > > > < > > > > > > > > > > https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552 > > > > > > > > > as > > > > an Iterator<DataPage> rather than a List<DataPage>, and modify > > > > ColumnChunkPageReader to support a lazy Collection of data pages? > > > > > > > > Let me know what you think! It's possible that I'm misunderstanding > how > > > > readNextRowGroup works -- Parquet internals are a steep learning > curve > > :) > > > > > > > > Best, > > > > Claire > > > > > > > > > >