Hi Zoltan,

If I remember correctly, this is what we had in mind in this question:

Although page boundaries for v1 pages do not have to be record boundaries,
nothing prevents us from implementing a writer that does align pages to
record boundaries. (Of course, on the read path, we have to be able to
handle pages that are not aligned in respect to records.)

When we add indexes to the picture, aligning pages to record boundaries
becomes desirable. And since indexes are a new feature, we can simply
implement them in a way so that we always do this alignment when writing
pages with indexes.

Br,

Zoltan

On Thu, Jul 12, 2018 at 7:12 PM Zoltan Borok-Nagy
<[email protected]> wrote:

> Hi everyone,
>
> Currently I am working on the implementation of the Parquet page index for
> Impala.
> (design doc is here if you are interested:
>
> https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
> )
>
> During our discussions it came up that DataPageHeaderV2 states that page
> boundaries are also record boundaries:
>
>
> https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532
>
>
>
>
>
> DataPageHeader(V1) doesn't have this statement, which means that in theory
> it allows records to span through multiple pages. Is it really the case, or
> is it something that is missing from the specification?
>
> I ask this because filtering pages based on the page index is much more
> simple if page boundaries are record boundaries as well.
>
> Thanks,
>     Zoltan
>

Reply via email to