Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Jan Finis Sat, 09 Aug 2025 06:07:53 -0700

Hey Ryan,

Thanks for chiming in. First of all, to make it quick: Yes, the solution of
having nan_counts *and* total order, which was brought up multiple times,
does work and solves more cases than just either of both.


I strongly prefer continuing to discuss the merits of these approaches
> rather than trying to decide with a vote.


In theory, I agree that it isn't good to silence a discussion by just
voting for one possible solution and technical issues should be discussed.
However, please note that we have been circling on this for over two years
now, including an extended discussion that brought up all arguments
multiple times. This is in stark contrast to the
speed with which you guys work on the Iceberg spec, for example. There, you
also do not discuss the merits of various solutions for multiple years. You
just pick one and merge it after a *reasonable* time of discussion.
If you had the speed we currently have here, nothing would get done. Thus,
I see this as a clear case of *"the perfect is the enemy of the good"*.
Yes, we can continue looking for the perfect solution,
but that will likely lead to keeping us at the status quo, which is the
worst of them all.

That being said, I'm also happy to create a PR which does both total order
and NaN counts; after all, I just want the issue solved and all these
solutions are better than the status quo.

*As this was now suggest by at least three people, I guess it's worth
doing, so here you go:https://github.com/apache/parquet-format/pull/514
<https://github.com/apache/parquet-format/pull/514>*

With this, we should have PRs covering most of the solution space.
(I'm refusing to create a PR with negative and positive nan_counts;
nan_counts + total order has to suffice; the complexity madness has to stop
somewhere)
I still believe that there was an amount of people who already found
nan_counts too complex and therefore wanted IEEE total order, and these
people may not like putting on extra complexity,
but let's see, maybe some have also changed their opinion in the meantime.


*Given all this, we can also first do an informal vote where everyone can
vote for which of the three their favorite would be.Maybe a clear favorite
will emerge and then we can vote on this one.*

But of course, we can also take some weeks to discuss the three solutions,
now that we have PRs for all of them. I just hope this won't make us
continue for another 2 years, or an
infinite stalemate where each solution is vetoed by a PMC member.
(Sorry for becoming a bit cynical here; I have just spent way too much time
of my life with double statistics at this point ;) ...)


Cheers,
Jan

Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <[email protected]>:

> Regarding the process for this, I strongly prefer continuing to discuss the
> merits of these approaches rather than trying to decide with a vote. I
> don't think it is a good practice to use a vote to decide on a technical
> direction. There are very few situations that warrant it and I don't think
> that this is one of them. While this issue has been open for a long time,
> that appears to be the result of it not being anyone's top priority rather
> than indecision.
>
> For the technical merits of these approaches, I think that we can find a
> middle ground. I agree with Jan that when working with sorted values, we
> need to know how NaN values were handled and that requires using a
> well-defined order that includes NaN and its variations (because we should
> not normalize). Using NaN count is not sufficient for ordering rows.
>
> Gijs also brings up good points about how NaN values show up in actual
> datasets: not just when used in place of null, but also as the result of
> normal calculations on abnormal data, like `sqrt(-4.0)` or `log(-1.0)`.
> Both of those present problems when mixed with valid data because of the
> stats "poisoning" problem, where the range of valid data is usable until a
> single NaN is mixed in.
>
> Another issue is that NaN is error-prone because "regular" comparison is
> always false:
> ```
> Math.log(-1.0) >= 2 => FALSE
> Math.log(-1.0) < 2 => FALSE
> 2 > Math.log(-1.0) => FALSE
> ```
>
> As a result, Iceberg doesn't trust NaN values as either lower or upper
> bounds because we don't want to go back to the code that produced the value
> to see what the comparison order was to determine whether NaN values go
> before or after others.
>
> Total order solves the second issue in theory, but regular comparison is
> prevalent and not obvious to developers. And it also doesn't help when NaN
> is used instead of null. So using total order is not sufficient for data
> skipping.
>
> I think the right compromise is to use `min`, `max`, and `nan_count` for
> data skipping stats (where min and max cannot be NaN) and total ordering
> for sorting values. That satisfies the data skipping use cases and also
> gives us an ordering of unaltered values that we can reason about.
>
> Does anyone think that doesn't work?
>
> Ryan
>
> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]> wrote:
>
> > Thanks Jan for your endless effort on this!
> >
> > I'm in favor of simplicity and generalism. I think we have already
> debated
> > a lot
> > for `nan_count` in [1] and [2] is the reflection of those discussions.
> > Therefore
> > I am inclined to start a vote for [2] unless there is a significantly
> > better
> > proposal.
> >
> > I would suggest everyone interested in this discussion to attend the
> > scheduled
> > sync on Aug 6th (detailed below) to spread the word to the broader
> > community.
> > If we can get a consensus on [2], I can help start the vote and move
> > forward.
> >
> > *Apache Parquet Community Sync Wednesday, August 6 · 10:00 – 11:00am *
> > *Time zone: America/Los_Angeles*
> > *Google Meet joining info Video call link:
> > https://meet.google.com/bhe-rvan-qjk
> > <https://meet.google.com/bhe-rvan-qjk> *
> >
> > [1] https://github.com/apache/parquet-format/pull/196
> > [2] https://github.com/apache/parquet-format/pull/221
> >
> > Best,
> > Gang
> >
> >
> > On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <[email protected]> wrote:
> >
> > > Hi Gijs,
> > >
> > > Thank you for bringing up concrete points, I'm happy to discuss them in
> > > detail.
> > >
> > > NaNs are less common in the SQL world than in the DataFrame world where
> > > > NaNs were used for a long time to represent missing values.
> > >
> > >
> > > You could transcode between NULL to NaN before reading and writing to
> > > Parquet. You basically mention yourself that NaNs were used for missing
> > > values, i.e., what is commonly a NULL, which wasn't available. So,
> > > semantically, transcoding to NULL would even be the sane thing to do.
> > Yes,
> > > that will cost you some cycles, but should be a rather lightweight
> > > operation in comparison to most other operations, so I would argue that
> > it
> > > won't totally ruin your performance. Similarly, why should Parquet play
> > > along with a "hack" that was done in other frameworks due to
> shortcomings
> > > of those frameworks? So from a philosophical point of view, I think
> > > supporting NaNs better is the wrong thing to do. Rather, we should be a
> > > forcing function to align others to better behavior, so appling a bit
> of
> > > force might in the long run make people use NULLs also in DataFrames.
> > >
> > > Of course, your argument also goes into the direction of pragmatism:
> If a
> > > large part of the data science world uses NaNs to encode missing
> values,
> > > then maybe Parquet should accept this de-facto standard rather than
> > > fighting it. That is indeed a valid point. The weight of it is
> debatable
> > > and my personal conclusion is that it's still not worth it, as you can
> > > transcode between NULLs and NaNs, but I do agree with its validity.
> > >
> > >
> > > Since the proposal phrases it as a goal to work "regardless of how they
> > > > order NaN w.r.t. other values" this statement feels out-of-place to
> me.
> > > > Most hardware and most people don't care about total ordering and
> > needing
> > > > to take it into account while filtering using statistics seems like
> > > > preferring the special case instead of the common case. Almost noone
> > > > filters for specific NaN value bit-patterns. SQL engines that don't
> > have
> > > > IEEE total ordering as their default ordering for floats will also
> need
> > > to
> > > > do more special handling for this.
> > >
> > >
> > > I disagree with the conclusion this statement draws. The current
> > behavior,
> > > and nan_counts without total ordering, pose a real problem here, even
> for
> > > engines that don't care about bit patterns. I do agree that most
> database
> > > engines, including the one I'm working on, do not care about bit
> patterns
> > > and/or sign bits. However, how can our database engine know whether the
> > > writer of a Parquet file saw it the same way? It can't. Therefore, it
> > > cannot know whether a writer, for example, ordered NaNs before or after
> > all
> > > other numbers, or maybe ordered them by sign bit. So, if our database
> > > engine now sees a float column in sorting columns, it cannot apply any
> > > optimization without a lot of special casing, as it doesn't know
> whether
> > > NaNs will be before all other values, after all other values, or maybe
> > > both, depending on sign bit. It could apply contrived logic that tries
> to
> > > infer where NaNs were placed from the NaN counts of the first and last
> > > page, but doing so will be a lot of ugly code that also feels to be in
> > the
> > > wrong place. I.e., I don't want to need to load pages or the page
> index,
> > > just to reason about a sort order.
> > >
> > > SQL engines that don't have
> > > > IEEE total ordering as their default ordering for floats will also
> need
> > > to
> > > > do more special handling for this.
> > >
> > >
> > > This code, which I would indeed need to write for our engine, is
> > comparably
> > > trivial. Simply choose the largest possible bit pattern as comparison
> for
> > > upper bounds filtering for NaN, and the smallest possible bit pattern
> for
> > > lower bounds. It's not more than a few lines of code that check
> whether a
> > > filter is NaN and then replace its value with the highest/lowest NaN
> bit
> > > pattern. It is similarly trivial to the special casing I need to do
> with
> > > nan_counts, and it is way more trivial than the extra code I would need
> > to
> > > write for sorting columns, as depicted above.
> > >
> > > From a Polars perspective, having a `nan_count` and defining what
> > > > happens to the `min` and `max` statistics when a page contains only
> > NaNs
> > > is
> > > > enough to allow for all predicate filtering. I think, but correct me
> > if I
> > > > am wrong, this is also enough for all SQL engines that don't use
> total
> > > > ordering.
> > >
> > >
> > > It's not fully enough, as depicted above. Sorting columns would still
> not
> > > work properly.
> > >
> > > As for ways forward, I propose merging the `nan_count` and `sort
> > ordering`
> > > > proposals into one to make one proposal
> > >
> > >
> > > Note that the initial reason for proposing IEEE total order was that
> > people
> > > in the discussion threads found nan_counts to be too complex and too
> much
> > > of an undeserving special case (re-read the discussion in the initial
> PR
> > > <https://github.com/apache/parquet-format/pull/196> to see the
> > > rationales).
> > > So merging both together would go totally against the spirit of why
> IEEE
> > > total order was proposed. While it has further upsides, the main reason
> > was
> > > indeed to *not have* nan_counts. If now the proposal would even go to
> > > positive and negative nan counts (i.e., even more complexity), this
> would
> > > go 180 degrees into the opposite direction of why people wanted total
> > order
> > > in the first place.
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
> > > <[email protected]>:
> > >
> > > > Hello Jan and others,
> > > >
> > > > First, let me preface by saying I am quite new here. So I apologize
> if
> > > > there is some other better way to bring up these concerns. I
> understand
> > > it
> > > > is very annoying to come in at the 11th hour and start bringing up a
> > > bunch
> > > > of concerns, but I would also like this to be done right. A colleague
> > of
> > > > mine brought up some concerns and alternative approaches in the
> GitHub
> > > > thread; I will file some of the concerns here as a response.
> > > >
> > > > > Treating NaNs so specially is giving them attention they don't
> > deserve.
> > > > Most data sets do not contain NaNs. If a use case really requires
> them
> > > and
> > > > needs filtering to ignore them, they can store NULL instead, or
> encode
> > > them
> > > > differently. I would prefer the average case over the special case
> > here.
> > > >
> > > > NaNs are less common in the SQL world than in the DataFrame world
> where
> > > > NaNs were used for a long time to represent missing values. They
> still
> > > > exist with different canonical representations and different sign
> > bits. I
> > > > agree it might not be correct semantically, but sadly that is the
> world
> > > we
> > > > deal with. NumPy and Numba do not have missing data functionality,
> > people
> > > > use NaNs there, and people definitely use that in their analytical
> > > > dataflows. Another point that was brought up in the GH discussion was
> > > "what
> > > > about infinity? You could argue that having infinity in statistics is
> > > > similarly unuseful as it's too wide of a bound". I would argue that
> > > > infinity is very different as there is no discussion on what the
> > ordering
> > > > or pattern of infinity is. Everyone agrees that `min(1.0, inf, -inf)
> ==
> > > > -inf` and each infinity only has a single bit pattern.
> > > >
> > > > > It gives a defined order to every bit pattern and thus yields a
> total
> > > > order, mathematically speaking, which has value by itself. With NaN
> > > counts,
> > > > it was still undefined how different bit patterns of NaNs were
> supposed
> > > to
> > > > be ordered, whether NaN was allowed to have a sign bit, etc., risking
> > > that
> > > > different engines could come to different results while filtering or
> > > > sorting values within a file.
> > > >
> > > > Since the proposal phrases it as a goal to work "regardless of how
> they
> > > > order NaN w.r.t. other values" this statement feels out-of-place to
> me.
> > > > Most hardware and most people don't care about total ordering and
> > needing
> > > > to take it into account while filtering using statistics seems like
> > > > preferring the special case instead of the common case. Almost noone
> > > > filters for specific NaN value bit-patterns. SQL engines that don't
> > have
> > > > IEEE total ordering as their default ordering for floats will also
> need
> > > to
> > > > do more special handling for this.
> > > >
> > > > I also agree with my colleague that doing an approach that is 50% of
> > the
> > > > way there will make the barrier to improving it to what it actually
> > > should
> > > > be later on much higher.
> > > >
> > > > As for ways forward, I propose merging the `nan_count` and `sort
> > > ordering`
> > > > proposals into one to make one proposal, as they are linked together,
> > and
> > > > moving forward with one without knowing what will happen to the other
> > > seems
> > > > unwise. From a Polars perspective, having a `nan_count` and defining
> > what
> > > > happens to the `min` and `max` statistics when a page contains only
> > NaNs
> > > is
> > > > enough to allow for all predicate filtering. I think, but correct me
> > if I
> > > > am wrong, this is also enough for all SQL engines that don't use
> total
> > > > ordering. But if you want to be impartial to the engine's
> > floating-point
> > > > ordering and allow engines with total ordering to do inequality
> filters
> > > > when `nan_count > 0` you would need a `positive_nan_count` and a
> > > > `negative_nan_count`. I understand the downside with Thrift
> complexity,
> > > but
> > > > introducing another sort order is also adding complexity just in a
> > > > different place.
> > > >
> > > > I would really like to see this move forward, so I hope these
> concerns
> > > help
> > > > move it forward towards a solution that works for everyone.
> > > >
> > > > Kind regards,
> > > > Gijs
> > > >
> > > >
> > > > On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <[email protected]>
> > > > wrote:
> > > >
> > > > > I would also be in favor of starting a vote
> > > > >
> > > > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <[email protected]>
> > wrote:
> > > > >
> > > > > > As the author of both the IEEE754 total order
> > > > > > <https://github.com/apache/parquet-format/pull/221> PR and the
> > > earlier
> > > > > PR
> > > > > > that basically proposed `nan_count`
> > > > > > <https://github.com/apache/parquet-format/pull/196>, my current
> > vote
> > > > > would
> > > > > > be for IEEE754 total order.
> > > > > > Consequently, I would like to request a formal vote for the PR
> > > > > introducing
> > > > > > IEEE754 total order (
> > > https://github.com/apache/parquet-format/pull/221
> > > > ),
> > > > > > if
> > > > > > that is possible.
> > > > > >
> > > > > > My Rationales:
> > > > > >
> > > > > >    - It's conceptually simpler. It's easier to explain. It's
> based
> > on
> > > > an
> > > > > >    IEEE-standardized order predicate.
> > > > > >    - There are already multiple implementations showing
> > feasibility.
> > > > This
> > > > > >    will likely make the adoption quicker.
> > > > > >    - It gives a defined order to every bit pattern and thus
> yields
> > a
> > > > > total
> > > > > >    order, mathematically speaking, which has value by itself.
> With
> > > NaN
> > > > > > counts,
> > > > > >    it was still undefined how different bit patterns of NaNs were
> > > > > supposed
> > > > > > to
> > > > > >    be ordered, whether NaN was allowed to have a sign bit, etc.,
> > > > risking
> > > > > > that
> > > > > >    different engines could come to different results while
> > filtering
> > > or
> > > > > >    sorting values within a file.
> > > > > >    - It also solves sort order completely. With nan_counts only,
> it
> > > is
> > > > > >    still undefined whether nans should be sorted before or after
> > all
> > > > > values
> > > > > >    (or both, depending on sign bit), so any file including NaNs
> > could
> > > > not
> > > > > >    really leverage sort order without being ambiguous.
> > > > > >    - It's less complex in thrift. Having fields that only apply
> to
> > a
> > > > > >    handful of data types is somehow weird. If every type did
> this,
> > we
> > > > > would
> > > > > >    have a plethora of non-generic fields in thrift.
> > > > > >    - Treating NaNs so specially is giving them attention they
> don't
> > > > > >    deserve. Most data sets do not contain NaNs. If a use case
> > really
> > > > > > requires
> > > > > >    them and needs filtering to ignore them, they can store NULL
> > > > instead,
> > > > > >    or encode them differently. I would prefer the average case
> over
> > > the
> > > > > >    special case here.
> > > > > >    - The majority of the people discussing this so far seem to
> > favor
> > > > > total
> > > > > >    order.
> > > > > >
> > > > > > Cheers,
> > > > > > Jan
> > > > > >
> > > > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <
> > [email protected]
> > > >:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > As this discussion has been open for more than two years, I’d
> > like
> > > to
> > > > > > bump
> > > > > > > up
> > > > > > > this thread again to update the progress and collect feedback.
> > > > > > >
> > > > > > > *Background*
> > > > > > > • Today Parquet’s min/max stats and page index omit NaNs
> > entirely.
> > > > > > > • Engines can’t safely prune floating values because they know
> > > > nothing
> > > > > on
> > > > > > > NaNs.
> > > > > > > • Column index is disabled if any page contains only NaNs.
> > > > > > >
> > > > > > > There are two active proposals as below:
> > > > > > >
> > > > > > > *Proposal A - IEEE754TotalOrder* (from the PR [1])
> > > > > > > • Define a new ColumnOrder to include +0, –0 and all NaN
> > > > bit‐patterns.
> > > > > > > • Stats and column index store NaNs if they appear.
> > > > > > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and
> > > > parquet-java
> > > > > > [4].
> > > > > > > • For more context of this approach, please refer to discussion
> > in
> > > > [5].
> > > > > > >
> > > > > > > *Proposal B - add nan_count* (from a comment [6] to [1])
> > > > > > > • Add `nan_count` to stats and a `nan_counts` list to column
> > index.
> > > > > > > • For all‐NaNs cases, write NaN to min/max and use nan_count to
> > > > > > > distinguish.
> > > > > > >
> > > > > > > Both solutions have pros and cons but are way better than the
> > > status
> > > > > quo
> > > > > > > today.
> > > > > > > Please share your thoughts on the two proposals above, or maybe
> > > come
> > > > up
> > > > > > > with
> > > > > > > better alternatives. We need consensus on one proposal and move
> > > > > forward.
> > > > > > >
> > > > > > > [1] https://github.com/apache/parquet-format/pull/221
> > > > > > > [2] https://github.com/apache/arrow-rs/pull/7408
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > > > > > > [4] https://github.com/apache/parquet-java/pull/3191
> > > > > > > [5] https://github.com/apache/parquet-format/pull/196
> > > > > > > [6]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> > > > > > >
> > > > > > > Best,
> > > > > > > Gang
> > > > > > >
> > > > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > > Dear contributors,
> > > > > > > >
> > > > > > > > My PR has now gathered comments for a week and the gist of
> all
> > > open
> > > > > > > issues
> > > > > > > > is the question of how to encode pages/column chunks that
> > contain
> > > > > only
> > > > > > > > NaNs. There are different suggestions and I don't see one
> > common
> > > > > > favorite
> > > > > > > > yet.
> > > > > > > >
> > > > > > > > I have outlined three alternatives of how we can handle these
> > > and I
> > > > > > want
> > > > > > > us
> > > > > > > > to reach a conclusion here, so I can update my PR accordingly
> > and
> > > > > move
> > > > > > on
> > > > > > > > with it. As this is my first contribution to parquet, I don't
> > > know
> > > > > the
> > > > > > > > decision processes here. Do we vote? Is there a single or
> group
> > > of
> > > > > > > decision
> > > > > > > > makers? *Please let me know how to come to a conclusion here;
> > > what
> > > > > are
> > > > > > > the
> > > > > > > > next steps?*
> > > > > > > >
> > > > > > > > For reference, here are the three alternatives I pointed out.
> > You
> > > > can
> > > > > > > find
> > > > > > > > detailed description of their PROs and CONs in my comment:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > > > > > > >
> > > > > > > > 1. My initial proposal, i.e., encoding only-NaN pages by
> > > > min=max=NaN.
> > > > > > > > 2. Adding `num_values` to the ColumnIndex, to make it
> symmetric
> > > > with
> > > > > > > > Statistics in pages & `ColumnMetaData` and to enable the
> > > > computation
> > > > > > > > `num_values - null_count - nan_count == 0`
> > > > > > > > 3. Adding a `nan_pages` bool list to the column index, which
> > > > > indicates
> > > > > > > > whether a page contains only NaNs
> > > > > > > >
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > > Jan Finis
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to