Hi everybody, I'm trying to understand the IO mechanism and memory foot-print of the parquet-mr library. In particular, I wish to understand what happens when the ParquetFileReader reads the next row-group. For simplicity I'm interested to understand first the case where no filtering is required, and we want need to read all records in the file and print them out.
Does the ParquetFileReader load to its internal memory each time the entire row-group in advance? Can it be configured to read the file lazily and fined grained: At each step read only the current page for each column, instead of reading in advance all pages in the column-chunks in the row-group? I mean, read the first page of each column, process it and produce the records inside it, and only then read the second one etc. As I understand, the NextFilteredRowGroup method first figures out all metadata, and create an array of ConsecutivePartList for all the chunks we are about to read. After that, it calls readAll for each consecutiveChunk. In the case I'm reading all columns in the Parquet file, this ConsecutivePartList would contain all pages in all columns in the row group, right? inside the readAll method, ByteBuffers are allocated, and we call a readFully on them. Now, from what I understand, parquet-mr uses the HeapByteBuffer and DirectByteBuffer as its ByteBuffer. In particular, neither of them support lazy evaluation. So when you read data into them, it actually reads the data right away. So, Is it possible to configure the ParquetFileReader to read pages in the row-group lazily, and at each step read only the relevant pages for each column? Reagrds, Tomer Solomon
