In case anyone else is interested, the relevant parts of parquet.thrift I
think are [1] and [2].

I agree with Gang's interpretation that `num_nulls` is required vs
`null_count) in Statistics is optional.

Since Statistics is used in other places (e.g. ColumnMetadata[3]) I don't
think we could make the null_count required there (not that you were
proposing this)

A bit off topic, but I think including Statistics in general in page
headers is of limited use as to read them you need to have already fetched
the page (and thus the amount of work that can be skipped is often pretty
low by the time you have the page header). A better way is to include the
statistics in the ColumnIndex[4] which can be fetched independently and
then used to skip many pages at once

Andrew


[1]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L724
[2]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L291
[3]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912
[4]:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163

On Thu, Oct 9, 2025 at 2:15 AM Gang Wu <[email protected]> wrote:

> I think you're right.
>
> The only difference is that statistics is optional but the field in the
> header is required.
>
> Best,
> Gang
>
> On Wed, Oct 8, 2025 at 8:12 PM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Hello,
> >
> > It seems a V2 data page can have its number of nulls recorded in two
> > adjacent locations:
> > 1. the `num_nulls` field in `DataPageHeaderV2`
> > 2. the `null_count` field in `DataPageHeaderV2.statistics`
> >
> > Is this interpretation right? Or do those two fields actually have
> > different semantics.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Reply via email to