Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Jan Finis Fri, 01 Aug 2025 03:18:07 -0700

Hi Gijs,

Thank you for bringing up concrete points, I'm happy to discuss them in
detail.


NaNs are less common in the SQL world than in the DataFrame world where
> NaNs were used for a long time to represent missing values.


You could transcode between NULL to NaN before reading and writing to
Parquet. You basically mention yourself that NaNs were used for missing
values, i.e., what is commonly a NULL, which wasn't available. So,
semantically, transcoding to NULL would even be the sane thing to do. Yes,
that will cost you some cycles, but should be a rather lightweight
operation in comparison to most other operations, so I would argue that it
won't totally ruin your performance. Similarly, why should Parquet play
along with a "hack" that was done in other frameworks due to shortcomings
of those frameworks? So from a philosophical point of view, I think
supporting NaNs better is the wrong thing to do. Rather, we should be a
forcing function to align others to better behavior, so appling a bit of
force might in the long run make people use NULLs also in DataFrames.

Of course, your argument also goes into the direction of pragmatism: If a
large part of the data science world uses NaNs to encode missing values,
then maybe Parquet should accept this de-facto standard rather than
fighting it. That is indeed a valid point. The weight of it is debatable
and my personal conclusion is that it's still not worth it, as you can
transcode between NULLs and NaNs, but I do agree with its validity.


Since the proposal phrases it as a goal to work "regardless of how they
> order NaN w.r.t. other values" this statement feels out-of-place to me.
> Most hardware and most people don't care about total ordering and needing
> to take it into account while filtering using statistics seems like
> preferring the special case instead of the common case. Almost noone
> filters for specific NaN value bit-patterns. SQL engines that don't have
> IEEE total ordering as their default ordering for floats will also need to
> do more special handling for this.


I disagree with the conclusion this statement draws. The current behavior,
and nan_counts without total ordering, pose a real problem here, even for
engines that don't care about bit patterns. I do agree that most database
engines, including the one I'm working on, do not care about bit patterns
and/or sign bits. However, how can our database engine know whether the
writer of a Parquet file saw it the same way? It can't. Therefore, it
cannot know whether a writer, for example, ordered NaNs before or after all
other numbers, or maybe ordered them by sign bit. So, if our database
engine now sees a float column in sorting columns, it cannot apply any
optimization without a lot of special casing, as it doesn't know whether
NaNs will be before all other values, after all other values, or maybe
both, depending on sign bit. It could apply contrived logic that tries to
infer where NaNs were placed from the NaN counts of the first and last
page, but doing so will be a lot of ugly code that also feels to be in the
wrong place. I.e., I don't want to need to load pages or the page index,
just to reason about a sort order.

SQL engines that don't have
> IEEE total ordering as their default ordering for floats will also need to
> do more special handling for this.


This code, which I would indeed need to write for our engine, is comparably
trivial. Simply choose the largest possible bit pattern as comparison for
upper bounds filtering for NaN, and the smallest possible bit pattern for
lower bounds. It's not more than a few lines of code that check whether a
filter is NaN and then replace its value with the highest/lowest NaN bit
pattern. It is similarly trivial to the special casing I need to do with
nan_counts, and it is way more trivial than the extra code I would need to
write for sorting columns, as depicted above.

>From a Polars perspective, having a `nan_count` and defining what
> happens to the `min` and `max` statistics when a page contains only NaNs is
> enough to allow for all predicate filtering. I think, but correct me if I
> am wrong, this is also enough for all SQL engines that don't use total
> ordering.


It's not fully enough, as depicted above. Sorting columns would still not
work properly.

As for ways forward, I propose merging the `nan_count` and `sort ordering`
> proposals into one to make one proposal


Note that the initial reason for proposing IEEE total order was that people
in the discussion threads found nan_counts to be too complex and too much
of an undeserving special case (re-read the discussion in the initial PR
<https://github.com/apache/parquet-format/pull/196> to see the rationales).
So merging both together would go totally against the spirit of why IEEE
total order was proposed. While it has further upsides, the main reason was
indeed to *not have* nan_counts. If now the proposal would even go to
positive and negative nan counts (i.e., even more complexity), this would
go 180 degrees into the opposite direction of why people wanted total order
in the first place.

Cheers,
Jan

Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
<g...@polars.tech.invalid>:

> Hello Jan and others,
>
> First, let me preface by saying I am quite new here. So I apologize if
> there is some other better way to bring up these concerns. I understand it
> is very annoying to come in at the 11th hour and start bringing up a bunch
> of concerns, but I would also like this to be done right. A colleague of
> mine brought up some concerns and alternative approaches in the GitHub
> thread; I will file some of the concerns here as a response.
>
> > Treating NaNs so specially is giving them attention they don't deserve.
> Most data sets do not contain NaNs. If a use case really requires them and
> needs filtering to ignore them, they can store NULL instead, or encode them
> differently. I would prefer the average case over the special case here.
>
> NaNs are less common in the SQL world than in the DataFrame world where
> NaNs were used for a long time to represent missing values. They still
> exist with different canonical representations and different sign bits. I
> agree it might not be correct semantically, but sadly that is the world we
> deal with. NumPy and Numba do not have missing data functionality, people
> use NaNs there, and people definitely use that in their analytical
> dataflows. Another point that was brought up in the GH discussion was "what
> about infinity? You could argue that having infinity in statistics is
> similarly unuseful as it's too wide of a bound". I would argue that
> infinity is very different as there is no discussion on what the ordering
> or pattern of infinity is. Everyone agrees that `min(1.0, inf, -inf) ==
> -inf` and each infinity only has a single bit pattern.
>
> > It gives a defined order to every bit pattern and thus yields a total
> order, mathematically speaking, which has value by itself. With NaN counts,
> it was still undefined how different bit patterns of NaNs were supposed to
> be ordered, whether NaN was allowed to have a sign bit, etc., risking that
> different engines could come to different results while filtering or
> sorting values within a file.
>
> Since the proposal phrases it as a goal to work "regardless of how they
> order NaN w.r.t. other values" this statement feels out-of-place to me.
> Most hardware and most people don't care about total ordering and needing
> to take it into account while filtering using statistics seems like
> preferring the special case instead of the common case. Almost noone
> filters for specific NaN value bit-patterns. SQL engines that don't have
> IEEE total ordering as their default ordering for floats will also need to
> do more special handling for this.
>
> I also agree with my colleague that doing an approach that is 50% of the
> way there will make the barrier to improving it to what it actually should
> be later on much higher.
>
> As for ways forward, I propose merging the `nan_count` and `sort ordering`
> proposals into one to make one proposal, as they are linked together, and
> moving forward with one without knowing what will happen to the other seems
> unwise. From a Polars perspective, having a `nan_count` and defining what
> happens to the `min` and `max` statistics when a page contains only NaNs is
> enough to allow for all predicate filtering. I think, but correct me if I
> am wrong, this is also enough for all SQL engines that don't use total
> ordering. But if you want to be impartial to the engine's floating-point
> ordering and allow engines with total ordering to do inequality filters
> when `nan_count > 0` you would need a `positive_nan_count` and a
> `negative_nan_count`. I understand the downside with Thrift complexity, but
> introducing another sort order is also adding complexity just in a
> different place.
>
> I would really like to see this move forward, so I hope these concerns help
> move it forward towards a solution that works for everyone.
>
> Kind regards,
> Gijs
>
>
> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
>
> > I would also be in favor of starting a vote
> >
> > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <jpfi...@gmail.com> wrote:
> >
> > > As the author of both the IEEE754 total order
> > > <https://github.com/apache/parquet-format/pull/221> PR and the earlier
> > PR
> > > that basically proposed `nan_count`
> > > <https://github.com/apache/parquet-format/pull/196>, my current vote
> > would
> > > be for IEEE754 total order.
> > > Consequently, I would like to request a formal vote for the PR
> > introducing
> > > IEEE754 total order (https://github.com/apache/parquet-format/pull/221
> ),
> > > if
> > > that is possible.
> > >
> > > My Rationales:
> > >
> > >    - It's conceptually simpler. It's easier to explain. It's based on
> an
> > >    IEEE-standardized order predicate.
> > >    - There are already multiple implementations showing feasibility.
> This
> > >    will likely make the adoption quicker.
> > >    - It gives a defined order to every bit pattern and thus yields a
> > total
> > >    order, mathematically speaking, which has value by itself. With NaN
> > > counts,
> > >    it was still undefined how different bit patterns of NaNs were
> > supposed
> > > to
> > >    be ordered, whether NaN was allowed to have a sign bit, etc.,
> risking
> > > that
> > >    different engines could come to different results while filtering or
> > >    sorting values within a file.
> > >    - It also solves sort order completely. With nan_counts only, it is
> > >    still undefined whether nans should be sorted before or after all
> > values
> > >    (or both, depending on sign bit), so any file including NaNs could
> not
> > >    really leverage sort order without being ambiguous.
> > >    - It's less complex in thrift. Having fields that only apply to a
> > >    handful of data types is somehow weird. If every type did this, we
> > would
> > >    have a plethora of non-generic fields in thrift.
> > >    - Treating NaNs so specially is giving them attention they don't
> > >    deserve. Most data sets do not contain NaNs. If a use case really
> > > requires
> > >    them and needs filtering to ignore them, they can store NULL
> instead,
> > >    or encode them differently. I would prefer the average case over the
> > >    special case here.
> > >    - The majority of the people discussing this so far seem to favor
> > total
> > >    order.
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <ust...@gmail.com>:
> > >
> > > > Hi all,
> > > >
> > > > As this discussion has been open for more than two years, I’d like to
> > > bump
> > > > up
> > > > this thread again to update the progress and collect feedback.
> > > >
> > > > *Background*
> > > > • Today Parquet’s min/max stats and page index omit NaNs entirely.
> > > > • Engines can’t safely prune floating values because they know
> nothing
> > on
> > > > NaNs.
> > > > • Column index is disabled if any page contains only NaNs.
> > > >
> > > > There are two active proposals as below:
> > > >
> > > > *Proposal A - IEEE754TotalOrder* (from the PR [1])
> > > > • Define a new ColumnOrder to include +0, –0 and all NaN
> bit‐patterns.
> > > > • Stats and column index store NaNs if they appear.
> > > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and
> parquet-java
> > > [4].
> > > > • For more context of this approach, please refer to discussion in
> [5].
> > > >
> > > > *Proposal B - add nan_count* (from a comment [6] to [1])
> > > > • Add `nan_count` to stats and a `nan_counts` list to column index.
> > > > • For all‐NaNs cases, write NaN to min/max and use nan_count to
> > > > distinguish.
> > > >
> > > > Both solutions have pros and cons but are way better than the status
> > quo
> > > > today.
> > > > Please share your thoughts on the two proposals above, or maybe come
> up
> > > > with
> > > > better alternatives. We need consensus on one proposal and move
> > forward.
> > > >
> > > > [1] https://github.com/apache/parquet-format/pull/221
> > > > [2] https://github.com/apache/arrow-rs/pull/7408
> > > > [3]
> > > >
> > >
> >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > > > [4] https://github.com/apache/parquet-java/pull/3191
> > > > [5] https://github.com/apache/parquet-format/pull/196
> > > > [6]
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <jpfi...@gmail.com> wrote:
> > > >
> > > > > Dear contributors,
> > > > >
> > > > > My PR has now gathered comments for a week and the gist of all open
> > > > issues
> > > > > is the question of how to encode pages/column chunks that contain
> > only
> > > > > NaNs. There are different suggestions and I don't see one common
> > > favorite
> > > > > yet.
> > > > >
> > > > > I have outlined three alternatives of how we can handle these and I
> > > want
> > > > us
> > > > > to reach a conclusion here, so I can update my PR accordingly and
> > move
> > > on
> > > > > with it. As this is my first contribution to parquet, I don't know
> > the
> > > > > decision processes here. Do we vote? Is there a single or group of
> > > > decision
> > > > > makers? *Please let me know how to come to a conclusion here; what
> > are
> > > > the
> > > > > next steps?*
> > > > >
> > > > > For reference, here are the three alternatives I pointed out. You
> can
> > > > find
> > > > > detailed description of their PROs and CONs in my comment:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > > > >
> > > > > 1. My initial proposal, i.e., encoding only-NaN pages by
> min=max=NaN.
> > > > > 2. Adding `num_values` to the ColumnIndex, to make it symmetric
> with
> > > > > Statistics in pages & `ColumnMetaData` and to enable the
> computation
> > > > > `num_values - null_count - nan_count == 0`
> > > > > 3. Adding a `nan_pages` bool list to the column index, which
> > indicates
> > > > > whether a page contains only NaNs
> > > > >
> > > > >
> > > > > Cheers
> > > > > Jan Finis
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to