Re: Question about read granularity in ParquetFileReader

Claire McGinty Tue, 05 Mar 2024 05:59:42 -0800

Sounds good! I created PARQUET-2443
<https://issues.apache.org/jira/browse/PARQUET-2443>.


Best,
Claire

On Tue, Mar 5, 2024 at 8:43 AM Gábor Szádovszky <gabor.szadovs...@gmail.com>
wrote:

> Hi Claire,
>
> I think it would be better to continue the discussion in a related jira or
> even PR.
>
> Cheers,
> Gabor
>
> Claire McGinty <claire.d.mcgi...@gmail.com> ezt írta (időpont: 2024. márc.
> 5., K, 14:09):
>
> > Great, makes sense Gabor!
> >
> > Perhaps this could even be implemented via an Integer Configuration value
> > for how many pages, or page bytes, to buffer at a time, so that users can
> > balance IO speed with memory usage. I'll try out a few approaches and aim
> > to update this thread when I have something.
> >
> > Best,
> > Claire
> >
> >
> >
> > On Tue, Mar 5, 2024 at 2:55 AM Gábor Szádovszky <ga...@apache.org>
> wrote:
> >
> > > Hi Claire,
> > >
> > > I think you read it correctly. Your proposal sounds good to me but you
> > need
> > > to make it a separate way of reading instead of rewriting the current
> > > behavior. The current implementation figures out the consecutive parts
> in
> > > the file (multiple pages or even column chunks written after each
> other)
> > > and reads them in one attempt. This way the I/O is faster. Meanwhile,
> > your
> > > concerns are also completely valid so reading the pages lazily as they
> > are
> > > needed saves memory. It should be up to the API client to choose
> between
> > > the solutions.
> > > Since we already have the interfaces that we can hide our logic behind
> > > (PageReadStore/PageReader), probably the best way would be introducing
> an
> > > additional configuration that allows lazy reading behind the scenes.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > Claire McGinty <claire.d.mcgi...@gmail.com> ezt írta (időpont: 2024.
> > márc.
> > > 4., H, 21:04):
> > >
> > > > Hi all,
> > > >
> > > > I had a question about memory usage in ParquetFileReader,
> particularly
> > in
> > > > #readNextRowGroup
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937
> > > > >
> > > > /#readNextFilteredRowGroup
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076
> > > > >.
> > > > From what I can tell, these methods will enumerate all column chunks
> in
> > > > the row group, then for each chunk, fully read all pages in the
> chunk.
> > > >
> > > > I've been encountering memory issues performing heavy reads of
> Parquet
> > > > data, particularly use cases that require the colocation of multiple
> > > > Parquet files on a single worker. In cases like these, a single
> worker
> > > may
> > > > be reading dozens or hundreds of Parquet files, and trying to
> > materialize
> > > > row groups is causing OOMs, even with tweaked row group size.
> > > >
> > > > I'm wondering if there's any way to avoid materializing the entire
> row
> > > > group at once, and instead materialize pages on an as-needed basis
> > (along
> > > > with dictionary encoding etc when we start on a new chunk). Looking
> > > through
> > > > the ParquetFileReader code, a possible solution could be to
> > re-implement
> > > > pagesInChunk
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552
> > > > >
> > > > as
> > > > an Iterator<DataPage> rather than a List<DataPage>, and modify
> > > > ColumnChunkPageReader to support a lazy Collection of data pages?
> > > >
> > > > Let me know what you think! It's possible that I'm misunderstanding
> how
> > > > readNextRowGroup works -- Parquet internals are a steep learning
> curve
> > :)
> > > >
> > > > Best,
> > > > Claire
> > > >
> > >
> >
>

Re: Question about read granularity in ParquetFileReader

Reply via email to