Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Gijs Burghoorn Thu, 31 Jul 2025 14:23:49 -0700

Hello Jan and others,

First, let me preface by saying I am quite new here. So I apologize if
there is some other better way to bring up these concerns. I understand it
is very annoying to come in at the 11th hour and start bringing up a bunch
of concerns, but I would also like this to be done right. A colleague of
mine brought up some concerns and alternative approaches in the GitHub
thread; I will file some of the concerns here as a response.


> Treating NaNs so specially is giving them attention they don't deserve.
Most data sets do not contain NaNs. If a use case really requires them and
needs filtering to ignore them, they can store NULL instead, or encode them
differently. I would prefer the average case over the special case here.

NaNs are less common in the SQL world than in the DataFrame world where
NaNs were used for a long time to represent missing values. They still
exist with different canonical representations and different sign bits. I
agree it might not be correct semantically, but sadly that is the world we
deal with. NumPy and Numba do not have missing data functionality, people
use NaNs there, and people definitely use that in their analytical
dataflows. Another point that was brought up in the GH discussion was "what
about infinity? You could argue that having infinity in statistics is
similarly unuseful as it's too wide of a bound". I would argue that
infinity is very different as there is no discussion on what the ordering
or pattern of infinity is. Everyone agrees that `min(1.0, inf, -inf) ==
-inf` and each infinity only has a single bit pattern.

> It gives a defined order to every bit pattern and thus yields a total
order, mathematically speaking, which has value by itself. With NaN counts,
it was still undefined how different bit patterns of NaNs were supposed to
be ordered, whether NaN was allowed to have a sign bit, etc., risking that
different engines could come to different results while filtering or
sorting values within a file.

Since the proposal phrases it as a goal to work "regardless of how they
order NaN w.r.t. other values" this statement feels out-of-place to me.
Most hardware and most people don't care about total ordering and needing
to take it into account while filtering using statistics seems like
preferring the special case instead of the common case. Almost noone
filters for specific NaN value bit-patterns. SQL engines that don't have
IEEE total ordering as their default ordering for floats will also need to
do more special handling for this.

I also agree with my colleague that doing an approach that is 50% of the
way there will make the barrier to improving it to what it actually should
be later on much higher.

As for ways forward, I propose merging the `nan_count` and `sort ordering`
proposals into one to make one proposal, as they are linked together, and
moving forward with one without knowing what will happen to the other seems
unwise. From a Polars perspective, having a `nan_count` and defining what
happens to the `min` and `max` statistics when a page contains only NaNs is
enough to allow for all predicate filtering. I think, but correct me if I
am wrong, this is also enough for all SQL engines that don't use total
ordering. But if you want to be impartial to the engine's floating-point
ordering and allow engines with total ordering to do inequality filters
when `nan_count > 0` you would need a `positive_nan_count` and a
`negative_nan_count`. I understand the downside with Thrift complexity, but
introducing another sort order is also adding complexity just in a
different place.

I would really like to see this move forward, so I hope these concerns help
move it forward towards a solution that works for everyone.

Kind regards,
Gijs


On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <[email protected]> wrote:

> I would also be in favor of starting a vote
>
> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <[email protected]> wrote:
>
> > As the author of both the IEEE754 total order
> > <https://github.com/apache/parquet-format/pull/221> PR and the earlier
> PR
> > that basically proposed `nan_count`
> > <https://github.com/apache/parquet-format/pull/196>, my current vote
> would
> > be for IEEE754 total order.
> > Consequently, I would like to request a formal vote for the PR
> introducing
> > IEEE754 total order (https://github.com/apache/parquet-format/pull/221),
> > if
> > that is possible.
> >
> > My Rationales:
> >
> >    - It's conceptually simpler. It's easier to explain. It's based on an
> >    IEEE-standardized order predicate.
> >    - There are already multiple implementations showing feasibility. This
> >    will likely make the adoption quicker.
> >    - It gives a defined order to every bit pattern and thus yields a
> total
> >    order, mathematically speaking, which has value by itself. With NaN
> > counts,
> >    it was still undefined how different bit patterns of NaNs were
> supposed
> > to
> >    be ordered, whether NaN was allowed to have a sign bit, etc., risking
> > that
> >    different engines could come to different results while filtering or
> >    sorting values within a file.
> >    - It also solves sort order completely. With nan_counts only, it is
> >    still undefined whether nans should be sorted before or after all
> values
> >    (or both, depending on sign bit), so any file including NaNs could not
> >    really leverage sort order without being ambiguous.
> >    - It's less complex in thrift. Having fields that only apply to a
> >    handful of data types is somehow weird. If every type did this, we
> would
> >    have a plethora of non-generic fields in thrift.
> >    - Treating NaNs so specially is giving them attention they don't
> >    deserve. Most data sets do not contain NaNs. If a use case really
> > requires
> >    them and needs filtering to ignore them, they can store NULL instead,
> >    or encode them differently. I would prefer the average case over the
> >    special case here.
> >    - The majority of the people discussing this so far seem to favor
> total
> >    order.
> >
> > Cheers,
> > Jan
> >
> > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <[email protected]>:
> >
> > > Hi all,
> > >
> > > As this discussion has been open for more than two years, I’d like to
> > bump
> > > up
> > > this thread again to update the progress and collect feedback.
> > >
> > > *Background*
> > > • Today Parquet’s min/max stats and page index omit NaNs entirely.
> > > • Engines can’t safely prune floating values because they know nothing
> on
> > > NaNs.
> > > • Column index is disabled if any page contains only NaNs.
> > >
> > > There are two active proposals as below:
> > >
> > > *Proposal A - IEEE754TotalOrder* (from the PR [1])
> > > • Define a new ColumnOrder to include +0, –0 and all NaN bit‐patterns.
> > > • Stats and column index store NaNs if they appear.
> > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and parquet-java
> > [4].
> > > • For more context of this approach, please refer to discussion in [5].
> > >
> > > *Proposal B - add nan_count* (from a comment [6] to [1])
> > > • Add `nan_count` to stats and a `nan_counts` list to column index.
> > > • For all‐NaNs cases, write NaN to min/max and use nan_count to
> > > distinguish.
> > >
> > > Both solutions have pros and cons but are way better than the status
> quo
> > > today.
> > > Please share your thoughts on the two proposals above, or maybe come up
> > > with
> > > better alternatives. We need consensus on one proposal and move
> forward.
> > >
> > > [1] https://github.com/apache/parquet-format/pull/221
> > > [2] https://github.com/apache/arrow-rs/pull/7408
> > > [3]
> > >
> >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > > [4] https://github.com/apache/parquet-java/pull/3191
> > > [5] https://github.com/apache/parquet-format/pull/196
> > > [6]
> > >
> >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> > >
> > > Best,
> > > Gang
> > >
> > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <[email protected]> wrote:
> > >
> > > > Dear contributors,
> > > >
> > > > My PR has now gathered comments for a week and the gist of all open
> > > issues
> > > > is the question of how to encode pages/column chunks that contain
> only
> > > > NaNs. There are different suggestions and I don't see one common
> > favorite
> > > > yet.
> > > >
> > > > I have outlined three alternatives of how we can handle these and I
> > want
> > > us
> > > > to reach a conclusion here, so I can update my PR accordingly and
> move
> > on
> > > > with it. As this is my first contribution to parquet, I don't know
> the
> > > > decision processes here. Do we vote? Is there a single or group of
> > > decision
> > > > makers? *Please let me know how to come to a conclusion here; what
> are
> > > the
> > > > next steps?*
> > > >
> > > > For reference, here are the three alternatives I pointed out. You can
> > > find
> > > > detailed description of their PROs and CONs in my comment:
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > > >
> > > > 1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
> > > > 2. Adding `num_values` to the ColumnIndex, to make it symmetric with
> > > > Statistics in pages & `ColumnMetaData` and to enable the computation
> > > > `num_values - null_count - nan_count == 0`
> > > > 3. Adding a `nan_pages` bool list to the column index, which
> indicates
> > > > whether a page contains only NaNs
> > > >
> > > >
> > > > Cheers
> > > > Jan Finis
> > > >
> > >
> >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to