Zoltan,

If you write the new page index structure, you're required to split pages
on record boundaries even when using v1 pages. By making that requirement
on the write side, we ensure that this feature is compatible with both v1
and v2 pages.

rb

On Thu, Jul 12, 2018 at 10:12 AM Zoltan Borok-Nagy
<[email protected]> wrote:

> Hi everyone,
>
> Currently I am working on the implementation of the Parquet page index for
> Impala.
> (design doc is here if you are interested:
>
> https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
> )
>
> During our discussions it came up that DataPageHeaderV2 states that page
> boundaries are also record boundaries:
>
>
> https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532
>
>
>
>
>
> DataPageHeader(V1) doesn't have this statement, which means that in theory
> it allows records to span through multiple pages. Is it really the case, or
> is it something that is missing from the specification?
>
> I ask this because filtering pages based on the page index is much more
> simple if page boundaries are record boundaries as well.
>
> Thanks,
>     Zoltan
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to