We encountered a similar question / issue in the Rust parquet
implementation[1].

Raphael's conclusion was that pages need to start with r-level 0 if using
V2 data pages or if there is a page index. Among other reasons, if this
doesn't hold, it is not possible to do pushdown on nested columns as you
have no idea where the last record actually ends.

We updated the parquet-rs reader to make this assumption in [2]

If others on this thread agree I would be happy to draft a spec
clarification on this point

Andrew





[1] https://github.com/apache/arrow-rs/issues/3680
[2] https://github.com/apache/arrow-rs/pull/4943



On Fri, May 10, 2024 at 1:15 PM Jan Finis <jpfi...@gmail.com> wrote:

> Hey Parquet devs,
>
> I so far thought that Parquet mandates that records start at page
> boundaries, i.e., at r-level 0, and we have relied on this fact in some
> places of our engine. That means, there cannot be any data page for a
> REPEATED column that starts at an r-level > 0, as this would mean that a
> record would be split between multiple pages.
>
> I also found the two comments in parquet.thrift:
>
>   /** Number of rows in this data page. which means pages change on record
> > boundaries (r = 0) **/
> >   3: required i32 num_rows
>
>
>   /**
> >    * Index within the RowGroup of the first row of the page; this means
> > pages
> >    * change on record boundaries (r = 0).
> >    */
> >   3: required i64 first_row_index
>
>
> These comments seem to imply that my understanding is correct. However,
> they are worded very weakly, not like a mandate but more like a "by the
> way" comment.
>
> I haven't found any other mention of r-levels and page boundaries in the
> parquet-format repo (maybe I missed them?).
>
> I recently noticed that pyarrow.parquet splits repeated fields over
> multiple pages, so it violates this. This triggers assertions in our
> engine, so I want to understand what's the right course of action here.
>
> So, can we please clarify:
> *Does Parquet mandate that pages need to start at r-level 0?*
>
>    - I.e., is a parquet file with a page that starts at an r-level > 0 ill
>    formed? I.e., is this a bug in pyarrow.parquet?
>    - Or can pages start at r-level 0? If so, then what is the significance
>    of the comments in parquet.thrift?
>
>
> Cheers,
> Jan
>

Reply via email to