I'd like to suggest that we recommend writers do not ever split records across pages, as frankly it is quite a surprising behavior. However, as this was ambiguous historically, readers should tolerate it in the absence of an offset index. This ensures backwards compatibility, whilst encouraging writers not to do this, and ensuring that offset indexes can be used to prune IO. This is the approach we have taken in parquet-rs [1].

Kind Regards,

Raphael

[1]: https://github.com/apache/arrow-rs/pull/4943

On 21/05/2024 12:31, Gang Wu wrote:
BTW, it seems totally valid to create page index for a subset of
all columns. Does it mean columns without page index can have
their records spanning more than one page?

Best,
Gang

On Tue, May 21, 2024 at 7:26 PM Gang Wu <ust...@gmail.com> wrote:

I would like to ask if it is valid to create only ColumnIndex but omit
OffsetIndex?
My answer is NO according to [1]. If agreed, my inclination is option 1.

[1]
https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L1019-L1022



On Tue, May 21, 2024 at 6:31 PM wish maple <maplewish...@gmail.com> wrote:

I'm +1 on this, "Offset Index", "Page Index", "Column Index or Offset
Index" all looks good to me.

Best,
Xuwei Fu

Andrew Lamb <andrewlam...@gmail.com> 于2024年5月21日周二 18:07写道:

mapleFU brought up an excellent question[1].

Upon further research, a "page index" seems to consist of an OffsetIndex
and ColumnIndex, but some writers may only write OffsetIndex (and not
ColumnIndex). See discussion on [2]

Thus when we say "repeated fields must start at a page boundary if a
page
index is present OR data-page V2 is present," does that mean:
1. an OffsetIndex is present
2. both an OffsetIndex and ColumnIndex are present
3. Something else

It seems to me that since an OffsetIndex is in terms of numbers of
records,
if it were present that would require repetition_level=0 at page
boundaries (aka option 1).

Thoughts?
Andrew


[1]

https://github.com/apache/parquet-format/pull/244#discussion_r1607878045
[2]: https://github.com/apache/parquet-format/pull/245

On Sun, May 19, 2024 at 7:18 AM Andrew Lamb <andrewlam...@gmail.com>
wrote:

I have created a PR[1] to the spec to try and encode this mailing list
conversation and avoid future confusion.  Please have a look and let
me
know if it captures it correctly.

Thanks,
Andrew

[1]: https://github.com/apache/parquet-format/pull/244

On Wed, May 15, 2024 at 7:03 PM Julien Le Dem <jul...@apache.org>
wrote:
+1 The semantics of a row group is that it contains rows and
therefore
starts on R=0
I generally echo Ed's sentiment here.

On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <andrewlam...@gmail.com>
wrote:

Thank you all -- I have filed
https://issues.apache.org/jira/browse/PARQUET-2473 to track
clarifying
the
spec and will make a PR shortly


On Sun, May 12, 2024 at 12:18 AM wish maple <
maplewish...@gmail.com>
wrote:

IMO when Page V2 is present or PageIndex is enabled, the
boundaries
should be check[1]

[1]


https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237

Jan Finis <jpfi...@gmail.com> 于2024年5月11日周六 01:15写道:

Hey Parquet devs,

I so far thought that Parquet mandates that records start at
page
boundaries, i.e., at r-level 0, and we have relied on this
fact in
some
places of our engine. That means, there cannot be any data page
for
a
REPEATED column that starts at an r-level > 0, as this would
mean
that
a
record would be split between multiple pages.

I also found the two comments in parquet.thrift:

   /** Number of rows in this data page. which means pages
change
on
record
boundaries (r = 0) **/
   3: required i32 num_rows

   /**
    * Index within the RowGroup of the first row of the page;
this
means
pages
    * change on record boundaries (r = 0).
    */
   3: required i64 first_row_index

These comments seem to imply that my understanding is correct.
However,
they are worded very weakly, not like a mandate but more like a
"by
the
way" comment.

I haven't found any other mention of r-levels and page
boundaries
in
the
parquet-format repo (maybe I missed them?).

I recently noticed that pyarrow.parquet splits repeated fields
over
multiple pages, so it violates this. This triggers assertions
in
our
engine, so I want to understand what's the right course of
action
here.
So, can we please clarify:
*Does Parquet mandate that pages need to start at r-level 0?*

    - I.e., is a parquet file with a page that starts at an
r-level
0
ill
    formed? I.e., is this a bug in pyarrow.parquet?
    - Or can pages start at r-level 0? If so, then what is the
significance
    of the comments in parquet.thrift?


Cheers,
Jan

Reply via email to