Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Jan Finis Thu, 31 Jul 2025 08:23:20 -0700

As the author of both the IEEE754 total order
<https://github.com/apache/parquet-format/pull/221> PR and the earlier PR
that basically proposed `nan_count`
<https://github.com/apache/parquet-format/pull/196>, my current vote would
be for IEEE754 total order.
Consequently, I would like to request a formal vote for the PR introducing
IEEE754 total order (https://github.com/apache/parquet-format/pull/221), if
that is possible.


My Rationales:

   - It's conceptually simpler. It's easier to explain. It's based on an
   IEEE-standardized order predicate.
   - There are already multiple implementations showing feasibility. This
   will likely make the adoption quicker.
   - It gives a defined order to every bit pattern and thus yields a total
   order, mathematically speaking, which has value by itself. With NaN counts,
   it was still undefined how different bit patterns of NaNs were supposed to
   be ordered, whether NaN was allowed to have a sign bit, etc., risking that
   different engines could come to different results while filtering or
   sorting values within a file.
   - It also solves sort order completely. With nan_counts only, it is
   still undefined whether nans should be sorted before or after all values
   (or both, depending on sign bit), so any file including NaNs could not
   really leverage sort order without being ambiguous.
   - It's less complex in thrift. Having fields that only apply to a
   handful of data types is somehow weird. If every type did this, we would
   have a plethora of non-generic fields in thrift.
   - Treating NaNs so specially is giving them attention they don't
   deserve. Most data sets do not contain NaNs. If a use case really requires
   them and needs filtering to ignore them, they can store NULL instead,
   or encode them differently. I would prefer the average case over the
   special case here.
   - The majority of the people discussing this so far seem to favor total
   order.

Cheers,
Jan

Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <ust...@gmail.com>:

> Hi all,
>
> As this discussion has been open for more than two years, I’d like to bump
> up
> this thread again to update the progress and collect feedback.
>
> *Background*
> • Today Parquet’s min/max stats and page index omit NaNs entirely.
> • Engines can’t safely prune floating values because they know nothing on
> NaNs.
> • Column index is disabled if any page contains only NaNs.
>
> There are two active proposals as below:
>
> *Proposal A - IEEE754TotalOrder* (from the PR [1])
> • Define a new ColumnOrder to include +0, –0 and all NaN bit‐patterns.
> • Stats and column index store NaNs if they appear.
> • Three PoC impls are ready: arrow-rs [2], duckdb [3] and parquet-java [4].
> • For more context of this approach, please refer to discussion in [5].
>
> *Proposal B - add nan_count* (from a comment [6] to [1])
> • Add `nan_count` to stats and a `nan_counts` list to column index.
> • For all‐NaNs cases, write NaN to min/max and use nan_count to
> distinguish.
>
> Both solutions have pros and cons but are way better than the status quo
> today.
> Please share your thoughts on the two proposals above, or maybe come up
> with
> better alternatives. We need consensus on one proposal and move forward.
>
> [1] https://github.com/apache/parquet-format/pull/221
> [2] https://github.com/apache/arrow-rs/pull/7408
> [3]
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> [4] https://github.com/apache/parquet-java/pull/3191
> [5] https://github.com/apache/parquet-format/pull/196
> [6]
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
>
> Best,
> Gang
>
> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <jpfi...@gmail.com> wrote:
>
> > Dear contributors,
> >
> > My PR has now gathered comments for a week and the gist of all open
> issues
> > is the question of how to encode pages/column chunks that contain only
> > NaNs. There are different suggestions and I don't see one common favorite
> > yet.
> >
> > I have outlined three alternatives of how we can handle these and I want
> us
> > to reach a conclusion here, so I can update my PR accordingly and move on
> > with it. As this is my first contribution to parquet, I don't know the
> > decision processes here. Do we vote? Is there a single or group of
> decision
> > makers? *Please let me know how to come to a conclusion here; what are
> the
> > next steps?*
> >
> > For reference, here are the three alternatives I pointed out. You can
> find
> > detailed description of their PROs and CONs in my comment:
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> >
> > 1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
> > 2. Adding `num_values` to the ColumnIndex, to make it symmetric with
> > Statistics in pages & `ColumnMetaData` and to enable the computation
> > `num_values - null_count - nan_count == 0`
> > 3. Adding a `nan_pages` bool list to the column index, which indicates
> > whether a page contains only NaNs
> >
> >
> > Cheers
> > Jan Finis
> >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to