Hi everybody,

I'm trying to understand the IO mechanism and memory foot-print of the
parquet-mr library.
In particular, I wish to understand what happens when the ParquetFileReader
reads the next row-group. For simplicity I'm interested to understand first
the case where no filtering is required, and we want need to read all
records in the file and print them out.

Does the ParquetFileReader load to its internal memory each time the entire
row-group in advance? Can it be configured to read the file lazily and
fined grained: At each step read only the current page for each column,
instead of reading in advance all pages in the column-chunks in the
row-group? I mean, read the first page of each column, process it and
produce the records inside it, and only then read the second one etc.

As I understand, the NextFilteredRowGroup method first figures out all
metadata, and create an array of ConsecutivePartList for all the chunks we
are about to read. After that, it calls readAll for each consecutiveChunk.
In the case I'm reading all columns in the Parquet file, this
ConsecutivePartList would  contain all pages in all columns in the row
group, right?

inside the readAll method, ByteBuffers are allocated, and we call a
readFully on them. Now, from what I understand, parquet-mr uses the
HeapByteBuffer and DirectByteBuffer as its ByteBuffer. In particular,
neither of them support lazy evaluation. So when you read data into them,
it actually reads the data right away.

So, Is it possible to configure the ParquetFileReader to read pages in the
row-group lazily, and at each step read only the relevant pages for each
column?

Reagrds,
Tomer Solomon

Reply via email to