Hi Zoltan, If I remember correctly, this is what we had in mind in this question:
Although page boundaries for v1 pages do not have to be record boundaries, nothing prevents us from implementing a writer that does align pages to record boundaries. (Of course, on the read path, we have to be able to handle pages that are not aligned in respect to records.) When we add indexes to the picture, aligning pages to record boundaries becomes desirable. And since indexes are a new feature, we can simply implement them in a way so that we always do this alignment when writing pages with indexes. Br, Zoltan On Thu, Jul 12, 2018 at 7:12 PM Zoltan Borok-Nagy <[email protected]> wrote: > Hi everyone, > > Currently I am working on the implementation of the Parquet page index for > Impala. > (design doc is here if you are interested: > > https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing > ) > > During our discussions it came up that DataPageHeaderV2 states that page > boundaries are also record boundaries: > > > https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532 > > > > > > DataPageHeader(V1) doesn't have this statement, which means that in theory > it allows records to span through multiple pages. Is it really the case, or > is it something that is missing from the specification? > > I ask this because filtering pages based on the page index is much more > simple if page boundaries are record boundaries as well. > > Thanks, > Zoltan >
