Thank you all for your quick responses!

BR,
    Zoltan


On Thu, Jul 12, 2018 at 8:04 PM Zoltan Ivanfi <z...@cloudera.com.invalid>
wrote:

> Hi Zoltan,
>
> If I remember correctly, this is what we had in mind in this question:
>
> Although page boundaries for v1 pages do not have to be record boundaries,
> nothing prevents us from implementing a writer that does align pages to
> record boundaries. (Of course, on the read path, we have to be able to
> handle pages that are not aligned in respect to records.)
>
> When we add indexes to the picture, aligning pages to record boundaries
> becomes desirable. And since indexes are a new feature, we can simply
> implement them in a way so that we always do this alignment when writing
> pages with indexes.
>
> Br,
>
> Zoltan
>
> On Thu, Jul 12, 2018 at 7:12 PM Zoltan Borok-Nagy
> <borokna...@cloudera.com.invalid> wrote:
>
> > Hi everyone,
> >
> > Currently I am working on the implementation of the Parquet page index
> for
> > Impala.
> > (design doc is here if you are interested:
> >
> >
> https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
> > )
> >
> > During our discussions it came up that DataPageHeaderV2 states that page
> > boundaries are also record boundaries:
> >
> >
> >
> https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532
> >
> >
> >
> >
> >
> > DataPageHeader(V1) doesn't have this statement, which means that in
> theory
> > it allows records to span through multiple pages. Is it really the case,
> or
> > is it something that is missing from the specification?
> >
> > I ask this because filtering pages based on the page index is much more
> > simple if page boundaries are record boundaries as well.
> >
> > Thanks,
> >     Zoltan
> >
>

Reply via email to