Re: Parquet-mr - ParquetFileReader IO and memory foot-print

Gabor Szadovszky Mon, 04 Mar 2019 00:27:48 -0800

Hi Tomer,

parquet-mr does not support lazy reading currently. The reason is
performance.
The pages for one column are written one after another (aka column chunks)
and then similarly the other pages for the other columns. It means if you
would like to keep only one page per column in the memory it would require
so many seeks in the file to position the reading to next particular page.
It is much faster to read the consecutive parts in one read so you will
have much less IO.


Meanwhile, I understand it requires much more memory than for the lazy
reading you've suggested. It might be a good improvement for parquet-mr to
have a switchable lazy reading and also would be interesting to have some
benchmarks comparing them.

Regards,
Gabor

On Fri, Mar 1, 2019 at 8:11 PM Tomer Solomon <[email protected]>
wrote:

> Hi everybody,
>
> I'm trying to understand the IO mechanism and memory foot-print of the
> parquet-mr library.
> In particular, I wish to understand what happens when the ParquetFileReader
> reads the next row-group. For simplicity I'm interested to understand first
> the case where no filtering is required, and we want need to read all
> records in the file and print them out.
>
> Does the ParquetFileReader load to its internal memory each time the entire
> row-group in advance? Can it be configured to read the file lazily and
> fined grained: At each step read only the current page for each column,
> instead of reading in advance all pages in the column-chunks in the
> row-group? I mean, read the first page of each column, process it and
> produce the records inside it, and only then read the second one etc.
>
> As I understand, the NextFilteredRowGroup method first figures out all
> metadata, and create an array of ConsecutivePartList for all the chunks we
> are about to read. After that, it calls readAll for each consecutiveChunk.
> In the case I'm reading all columns in the Parquet file, this
> ConsecutivePartList would  contain all pages in all columns in the row
> group, right?
>
> inside the readAll method, ByteBuffers are allocated, and we call a
> readFully on them. Now, from what I understand, parquet-mr uses the
> HeapByteBuffer and DirectByteBuffer as its ByteBuffer. In particular,
> neither of them support lazy evaluation. So when you read data into them,
> it actually reads the data right away.
>
> So, Is it possible to configure the ParquetFileReader to read pages in the
> row-group lazily, and at each step read only the relevant pages for each
> column?
>
> Reagrds,
> Tomer Solomon
>

Re: Parquet-mr - ParquetFileReader IO and memory foot-print

Reply via email to