Great, makes sense Gabor!

Perhaps this could even be implemented via an Integer Configuration value
for how many pages, or page bytes, to buffer at a time, so that users can
balance IO speed with memory usage. I'll try out a few approaches and aim
to update this thread when I have something.

Best,
Claire



On Tue, Mar 5, 2024 at 2:55 AM Gábor Szádovszky <ga...@apache.org> wrote:

> Hi Claire,
>
> I think you read it correctly. Your proposal sounds good to me but you need
> to make it a separate way of reading instead of rewriting the current
> behavior. The current implementation figures out the consecutive parts in
> the file (multiple pages or even column chunks written after each other)
> and reads them in one attempt. This way the I/O is faster. Meanwhile, your
> concerns are also completely valid so reading the pages lazily as they are
> needed saves memory. It should be up to the API client to choose between
> the solutions.
> Since we already have the interfaces that we can hide our logic behind
> (PageReadStore/PageReader), probably the best way would be introducing an
> additional configuration that allows lazy reading behind the scenes.
>
> Cheers,
> Gabor
>
> Claire McGinty <claire.d.mcgi...@gmail.com> ezt írta (időpont: 2024. márc.
> 4., H, 21:04):
>
> > Hi all,
> >
> > I had a question about memory usage in ParquetFileReader, particularly in
> > #readNextRowGroup
> > <
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937
> > >
> > /#readNextFilteredRowGroup
> > <
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076
> > >.
> > From what I can tell, these methods will enumerate all column chunks in
> > the row group, then for each chunk, fully read all pages in the chunk.
> >
> > I've been encountering memory issues performing heavy reads of Parquet
> > data, particularly use cases that require the colocation of multiple
> > Parquet files on a single worker. In cases like these, a single worker
> may
> > be reading dozens or hundreds of Parquet files, and trying to materialize
> > row groups is causing OOMs, even with tweaked row group size.
> >
> > I'm wondering if there's any way to avoid materializing the entire row
> > group at once, and instead materialize pages on an as-needed basis (along
> > with dictionary encoding etc when we start on a new chunk). Looking
> through
> > the ParquetFileReader code, a possible solution could be to re-implement
> > pagesInChunk
> > <
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552
> > >
> > as
> > an Iterator<DataPage> rather than a List<DataPage>, and modify
> > ColumnChunkPageReader to support a lazy Collection of data pages?
> >
> > Let me know what you think! It's possible that I'm misunderstanding how
> > readNextRowGroup works -- Parquet internals are a steep learning curve :)
> >
> > Best,
> > Claire
> >
>

Reply via email to