Thank you all for your quick responses! BR, Zoltan
On Thu, Jul 12, 2018 at 8:04 PM Zoltan Ivanfi <z...@cloudera.com.invalid> wrote: > Hi Zoltan, > > If I remember correctly, this is what we had in mind in this question: > > Although page boundaries for v1 pages do not have to be record boundaries, > nothing prevents us from implementing a writer that does align pages to > record boundaries. (Of course, on the read path, we have to be able to > handle pages that are not aligned in respect to records.) > > When we add indexes to the picture, aligning pages to record boundaries > becomes desirable. And since indexes are a new feature, we can simply > implement them in a way so that we always do this alignment when writing > pages with indexes. > > Br, > > Zoltan > > On Thu, Jul 12, 2018 at 7:12 PM Zoltan Borok-Nagy > <borokna...@cloudera.com.invalid> wrote: > > > Hi everyone, > > > > Currently I am working on the implementation of the Parquet page index > for > > Impala. > > (design doc is here if you are interested: > > > > > https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing > > ) > > > > During our discussions it came up that DataPageHeaderV2 states that page > > boundaries are also record boundaries: > > > > > > > https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532 > > > > > > > > > > > > DataPageHeader(V1) doesn't have this statement, which means that in > theory > > it allows records to span through multiple pages. Is it really the case, > or > > is it something that is missing from the specification? > > > > I ask this because filtering pages based on the page index is much more > > simple if page boundaries are record boundaries as well. > > > > Thanks, > > Zoltan > > >