Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Andrew Lamb Thu, 31 Jul 2025 10:47:42 -0700

I would also be in favor of starting a vote

On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <jpfi...@gmail.com> wrote:


> As the author of both the IEEE754 total order
> <https://github.com/apache/parquet-format/pull/221> PR and the earlier PR
> that basically proposed `nan_count`
> <https://github.com/apache/parquet-format/pull/196>, my current vote would
> be for IEEE754 total order.
> Consequently, I would like to request a formal vote for the PR introducing
> IEEE754 total order (https://github.com/apache/parquet-format/pull/221),
> if
> that is possible.
>
> My Rationales:
>
>    - It's conceptually simpler. It's easier to explain. It's based on an
>    IEEE-standardized order predicate.
>    - There are already multiple implementations showing feasibility. This
>    will likely make the adoption quicker.
>    - It gives a defined order to every bit pattern and thus yields a total
>    order, mathematically speaking, which has value by itself. With NaN
> counts,
>    it was still undefined how different bit patterns of NaNs were supposed
> to
>    be ordered, whether NaN was allowed to have a sign bit, etc., risking
> that
>    different engines could come to different results while filtering or
>    sorting values within a file.
>    - It also solves sort order completely. With nan_counts only, it is
>    still undefined whether nans should be sorted before or after all values
>    (or both, depending on sign bit), so any file including NaNs could not
>    really leverage sort order without being ambiguous.
>    - It's less complex in thrift. Having fields that only apply to a
>    handful of data types is somehow weird. If every type did this, we would
>    have a plethora of non-generic fields in thrift.
>    - Treating NaNs so specially is giving them attention they don't
>    deserve. Most data sets do not contain NaNs. If a use case really
> requires
>    them and needs filtering to ignore them, they can store NULL instead,
>    or encode them differently. I would prefer the average case over the
>    special case here.
>    - The majority of the people discussing this so far seem to favor total
>    order.
>
> Cheers,
> Jan
>
> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <ust...@gmail.com>:
>
> > Hi all,
> >
> > As this discussion has been open for more than two years, I’d like to
> bump
> > up
> > this thread again to update the progress and collect feedback.
> >
> > *Background*
> > • Today Parquet’s min/max stats and page index omit NaNs entirely.
> > • Engines can’t safely prune floating values because they know nothing on
> > NaNs.
> > • Column index is disabled if any page contains only NaNs.
> >
> > There are two active proposals as below:
> >
> > *Proposal A - IEEE754TotalOrder* (from the PR [1])
> > • Define a new ColumnOrder to include +0, –0 and all NaN bit‐patterns.
> > • Stats and column index store NaNs if they appear.
> > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and parquet-java
> [4].
> > • For more context of this approach, please refer to discussion in [5].
> >
> > *Proposal B - add nan_count* (from a comment [6] to [1])
> > • Add `nan_count` to stats and a `nan_counts` list to column index.
> > • For all‐NaNs cases, write NaN to min/max and use nan_count to
> > distinguish.
> >
> > Both solutions have pros and cons but are way better than the status quo
> > today.
> > Please share your thoughts on the two proposals above, or maybe come up
> > with
> > better alternatives. We need consensus on one proposal and move forward.
> >
> > [1] https://github.com/apache/parquet-format/pull/221
> > [2] https://github.com/apache/arrow-rs/pull/7408
> > [3]
> >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > [4] https://github.com/apache/parquet-java/pull/3191
> > [5] https://github.com/apache/parquet-format/pull/196
> > [6]
> >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> >
> > Best,
> > Gang
> >
> > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <jpfi...@gmail.com> wrote:
> >
> > > Dear contributors,
> > >
> > > My PR has now gathered comments for a week and the gist of all open
> > issues
> > > is the question of how to encode pages/column chunks that contain only
> > > NaNs. There are different suggestions and I don't see one common
> favorite
> > > yet.
> > >
> > > I have outlined three alternatives of how we can handle these and I
> want
> > us
> > > to reach a conclusion here, so I can update my PR accordingly and move
> on
> > > with it. As this is my first contribution to parquet, I don't know the
> > > decision processes here. Do we vote? Is there a single or group of
> > decision
> > > makers? *Please let me know how to come to a conclusion here; what are
> > the
> > > next steps?*
> > >
> > > For reference, here are the three alternatives I pointed out. You can
> > find
> > > detailed description of their PROs and CONs in my comment:
> > >
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > >
> > > 1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
> > > 2. Adding `num_values` to the ColumnIndex, to make it symmetric with
> > > Statistics in pages & `ColumnMetaData` and to enable the computation
> > > `num_values - null_count - nan_count == 0`
> > > 3. Adding a `nan_pages` bool list to the column index, which indicates
> > > whether a page contains only NaNs
> > >
> > >
> > > Cheers
> > > Jan Finis
> > >
> >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to