Thank you Jan -- I read through the new combined proposal, and I thought it
looks good and addresses the feedback so far. I left some small style
suggestions, but nothing that is required from my perspective



On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <jpfi...@gmail.com> wrote:

> Hey Ryan,
>
> Thanks for chiming in. First of all, to make it quick: Yes, the solution of
> having nan_counts *and* total order, which was brought up multiple times,
> does work and solves more cases than just either of both.
>
> I strongly prefer continuing to discuss the merits of these approaches
> > rather than trying to decide with a vote.
>
>
> In theory, I agree that it isn't good to silence a discussion by just
> voting for one possible solution and technical issues should be discussed.
> However, please note that we have been circling on this for over two years
> now, including an extended discussion that brought up all arguments
> multiple times. This is in stark contrast to the
> speed with which you guys work on the Iceberg spec, for example. There, you
> also do not discuss the merits of various solutions for multiple years. You
> just pick one and merge it after a *reasonable* time of discussion.
> If you had the speed we currently have here, nothing would get done. Thus,
> I see this as a clear case of *"the perfect is the enemy of the good"*.
> Yes, we can continue looking for the perfect solution,
> but that will likely lead to keeping us at the status quo, which is the
> worst of them all.
>
> That being said, I'm also happy to create a PR which does both total order
> and NaN counts; after all, I just want the issue solved and all these
> solutions are better than the status quo.
>
> *As this was now suggest by at least three people, I guess it's worth
> doing, so here you go:https://github.com/apache/parquet-format/pull/514
> <https://github.com/apache/parquet-format/pull/514>*
>
> With this, we should have PRs covering most of the solution space.
> (I'm refusing to create a PR with negative and positive nan_counts;
> nan_counts + total order has to suffice; the complexity madness has to stop
> somewhere)
> I still believe that there was an amount of people who already found
> nan_counts too complex and therefore wanted IEEE total order, and these
> people may not like putting on extra complexity,
> but let's see, maybe some have also changed their opinion in the meantime.
>
>
> *Given all this, we can also first do an informal vote where everyone can
> vote for which of the three their favorite would be.Maybe a clear favorite
> will emerge and then we can vote on this one.*
>
> But of course, we can also take some weeks to discuss the three solutions,
> now that we have PRs for all of them. I just hope this won't make us
> continue for another 2 years, or an
> infinite stalemate where each solution is vetoed by a PMC member.
> (Sorry for becoming a bit cynical here; I have just spent way too much time
> of my life with double statistics at this point ;) ...)
>
>
> Cheers,
> Jan
>
> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <rdb...@gmail.com>:
>
> > Regarding the process for this, I strongly prefer continuing to discuss
> the
> > merits of these approaches rather than trying to decide with a vote. I
> > don't think it is a good practice to use a vote to decide on a technical
> > direction. There are very few situations that warrant it and I don't
> think
> > that this is one of them. While this issue has been open for a long time,
> > that appears to be the result of it not being anyone's top priority
> rather
> > than indecision.
> >
> > For the technical merits of these approaches, I think that we can find a
> > middle ground. I agree with Jan that when working with sorted values, we
> > need to know how NaN values were handled and that requires using a
> > well-defined order that includes NaN and its variations (because we
> should
> > not normalize). Using NaN count is not sufficient for ordering rows.
> >
> > Gijs also brings up good points about how NaN values show up in actual
> > datasets: not just when used in place of null, but also as the result of
> > normal calculations on abnormal data, like `sqrt(-4.0)` or `log(-1.0)`.
> > Both of those present problems when mixed with valid data because of the
> > stats "poisoning" problem, where the range of valid data is usable until
> a
> > single NaN is mixed in.
> >
> > Another issue is that NaN is error-prone because "regular" comparison is
> > always false:
> > ```
> > Math.log(-1.0) >= 2 => FALSE
> > Math.log(-1.0) < 2 => FALSE
> > 2 > Math.log(-1.0) => FALSE
> > ```
> >
> > As a result, Iceberg doesn't trust NaN values as either lower or upper
> > bounds because we don't want to go back to the code that produced the
> value
> > to see what the comparison order was to determine whether NaN values go
> > before or after others.
> >
> > Total order solves the second issue in theory, but regular comparison is
> > prevalent and not obvious to developers. And it also doesn't help when
> NaN
> > is used instead of null. So using total order is not sufficient for data
> > skipping.
> >
> > I think the right compromise is to use `min`, `max`, and `nan_count` for
> > data skipping stats (where min and max cannot be NaN) and total ordering
> > for sorting values. That satisfies the data skipping use cases and also
> > gives us an ordering of unaltered values that we can reason about.
> >
> > Does anyone think that doesn't work?
> >
> > Ryan
> >
> > On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <ust...@gmail.com> wrote:
> >
> > > Thanks Jan for your endless effort on this!
> > >
> > > I'm in favor of simplicity and generalism. I think we have already
> > debated
> > > a lot
> > > for `nan_count` in [1] and [2] is the reflection of those discussions.
> > > Therefore
> > > I am inclined to start a vote for [2] unless there is a significantly
> > > better
> > > proposal.
> > >
> > > I would suggest everyone interested in this discussion to attend the
> > > scheduled
> > > sync on Aug 6th (detailed below) to spread the word to the broader
> > > community.
> > > If we can get a consensus on [2], I can help start the vote and move
> > > forward.
> > >
> > > *Apache Parquet Community Sync Wednesday, August 6 · 10:00 – 11:00am *
> > > *Time zone: America/Los_Angeles*
> > > *Google Meet joining info Video call link:
> > > https://meet.google.com/bhe-rvan-qjk
> > > <https://meet.google.com/bhe-rvan-qjk> *
> > >
> > > [1] https://github.com/apache/parquet-format/pull/196
> > > [2] https://github.com/apache/parquet-format/pull/221
> > >
> > > Best,
> > > Gang
> > >
> > >
> > > On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <jpfi...@gmail.com> wrote:
> > >
> > > > Hi Gijs,
> > > >
> > > > Thank you for bringing up concrete points, I'm happy to discuss them
> in
> > > > detail.
> > > >
> > > > NaNs are less common in the SQL world than in the DataFrame world
> where
> > > > > NaNs were used for a long time to represent missing values.
> > > >
> > > >
> > > > You could transcode between NULL to NaN before reading and writing to
> > > > Parquet. You basically mention yourself that NaNs were used for
> missing
> > > > values, i.e., what is commonly a NULL, which wasn't available. So,
> > > > semantically, transcoding to NULL would even be the sane thing to do.
> > > Yes,
> > > > that will cost you some cycles, but should be a rather lightweight
> > > > operation in comparison to most other operations, so I would argue
> that
> > > it
> > > > won't totally ruin your performance. Similarly, why should Parquet
> play
> > > > along with a "hack" that was done in other frameworks due to
> > shortcomings
> > > > of those frameworks? So from a philosophical point of view, I think
> > > > supporting NaNs better is the wrong thing to do. Rather, we should
> be a
> > > > forcing function to align others to better behavior, so appling a bit
> > of
> > > > force might in the long run make people use NULLs also in DataFrames.
> > > >
> > > > Of course, your argument also goes into the direction of pragmatism:
> > If a
> > > > large part of the data science world uses NaNs to encode missing
> > values,
> > > > then maybe Parquet should accept this de-facto standard rather than
> > > > fighting it. That is indeed a valid point. The weight of it is
> > debatable
> > > > and my personal conclusion is that it's still not worth it, as you
> can
> > > > transcode between NULLs and NaNs, but I do agree with its validity.
> > > >
> > > >
> > > > Since the proposal phrases it as a goal to work "regardless of how
> they
> > > > > order NaN w.r.t. other values" this statement feels out-of-place to
> > me.
> > > > > Most hardware and most people don't care about total ordering and
> > > needing
> > > > > to take it into account while filtering using statistics seems like
> > > > > preferring the special case instead of the common case. Almost
> noone
> > > > > filters for specific NaN value bit-patterns. SQL engines that don't
> > > have
> > > > > IEEE total ordering as their default ordering for floats will also
> > need
> > > > to
> > > > > do more special handling for this.
> > > >
> > > >
> > > > I disagree with the conclusion this statement draws. The current
> > > behavior,
> > > > and nan_counts without total ordering, pose a real problem here, even
> > for
> > > > engines that don't care about bit patterns. I do agree that most
> > database
> > > > engines, including the one I'm working on, do not care about bit
> > patterns
> > > > and/or sign bits. However, how can our database engine know whether
> the
> > > > writer of a Parquet file saw it the same way? It can't. Therefore, it
> > > > cannot know whether a writer, for example, ordered NaNs before or
> after
> > > all
> > > > other numbers, or maybe ordered them by sign bit. So, if our database
> > > > engine now sees a float column in sorting columns, it cannot apply
> any
> > > > optimization without a lot of special casing, as it doesn't know
> > whether
> > > > NaNs will be before all other values, after all other values, or
> maybe
> > > > both, depending on sign bit. It could apply contrived logic that
> tries
> > to
> > > > infer where NaNs were placed from the NaN counts of the first and
> last
> > > > page, but doing so will be a lot of ugly code that also feels to be
> in
> > > the
> > > > wrong place. I.e., I don't want to need to load pages or the page
> > index,
> > > > just to reason about a sort order.
> > > >
> > > > SQL engines that don't have
> > > > > IEEE total ordering as their default ordering for floats will also
> > need
> > > > to
> > > > > do more special handling for this.
> > > >
> > > >
> > > > This code, which I would indeed need to write for our engine, is
> > > comparably
> > > > trivial. Simply choose the largest possible bit pattern as comparison
> > for
> > > > upper bounds filtering for NaN, and the smallest possible bit pattern
> > for
> > > > lower bounds. It's not more than a few lines of code that check
> > whether a
> > > > filter is NaN and then replace its value with the highest/lowest NaN
> > bit
> > > > pattern. It is similarly trivial to the special casing I need to do
> > with
> > > > nan_counts, and it is way more trivial than the extra code I would
> need
> > > to
> > > > write for sorting columns, as depicted above.
> > > >
> > > > From a Polars perspective, having a `nan_count` and defining what
> > > > > happens to the `min` and `max` statistics when a page contains only
> > > NaNs
> > > > is
> > > > > enough to allow for all predicate filtering. I think, but correct
> me
> > > if I
> > > > > am wrong, this is also enough for all SQL engines that don't use
> > total
> > > > > ordering.
> > > >
> > > >
> > > > It's not fully enough, as depicted above. Sorting columns would still
> > not
> > > > work properly.
> > > >
> > > > As for ways forward, I propose merging the `nan_count` and `sort
> > > ordering`
> > > > > proposals into one to make one proposal
> > > >
> > > >
> > > > Note that the initial reason for proposing IEEE total order was that
> > > people
> > > > in the discussion threads found nan_counts to be too complex and too
> > much
> > > > of an undeserving special case (re-read the discussion in the initial
> > PR
> > > > <https://github.com/apache/parquet-format/pull/196> to see the
> > > > rationales).
> > > > So merging both together would go totally against the spirit of why
> > IEEE
> > > > total order was proposed. While it has further upsides, the main
> reason
> > > was
> > > > indeed to *not have* nan_counts. If now the proposal would even go to
> > > > positive and negative nan counts (i.e., even more complexity), this
> > would
> > > > go 180 degrees into the opposite direction of why people wanted total
> > > order
> > > > in the first place.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
> > > > <g...@polars.tech.invalid>:
> > > >
> > > > > Hello Jan and others,
> > > > >
> > > > > First, let me preface by saying I am quite new here. So I apologize
> > if
> > > > > there is some other better way to bring up these concerns. I
> > understand
> > > > it
> > > > > is very annoying to come in at the 11th hour and start bringing up
> a
> > > > bunch
> > > > > of concerns, but I would also like this to be done right. A
> colleague
> > > of
> > > > > mine brought up some concerns and alternative approaches in the
> > GitHub
> > > > > thread; I will file some of the concerns here as a response.
> > > > >
> > > > > > Treating NaNs so specially is giving them attention they don't
> > > deserve.
> > > > > Most data sets do not contain NaNs. If a use case really requires
> > them
> > > > and
> > > > > needs filtering to ignore them, they can store NULL instead, or
> > encode
> > > > them
> > > > > differently. I would prefer the average case over the special case
> > > here.
> > > > >
> > > > > NaNs are less common in the SQL world than in the DataFrame world
> > where
> > > > > NaNs were used for a long time to represent missing values. They
> > still
> > > > > exist with different canonical representations and different sign
> > > bits. I
> > > > > agree it might not be correct semantically, but sadly that is the
> > world
> > > > we
> > > > > deal with. NumPy and Numba do not have missing data functionality,
> > > people
> > > > > use NaNs there, and people definitely use that in their analytical
> > > > > dataflows. Another point that was brought up in the GH discussion
> was
> > > > "what
> > > > > about infinity? You could argue that having infinity in statistics
> is
> > > > > similarly unuseful as it's too wide of a bound". I would argue that
> > > > > infinity is very different as there is no discussion on what the
> > > ordering
> > > > > or pattern of infinity is. Everyone agrees that `min(1.0, inf,
> -inf)
> > ==
> > > > > -inf` and each infinity only has a single bit pattern.
> > > > >
> > > > > > It gives a defined order to every bit pattern and thus yields a
> > total
> > > > > order, mathematically speaking, which has value by itself. With NaN
> > > > counts,
> > > > > it was still undefined how different bit patterns of NaNs were
> > supposed
> > > > to
> > > > > be ordered, whether NaN was allowed to have a sign bit, etc.,
> risking
> > > > that
> > > > > different engines could come to different results while filtering
> or
> > > > > sorting values within a file.
> > > > >
> > > > > Since the proposal phrases it as a goal to work "regardless of how
> > they
> > > > > order NaN w.r.t. other values" this statement feels out-of-place to
> > me.
> > > > > Most hardware and most people don't care about total ordering and
> > > needing
> > > > > to take it into account while filtering using statistics seems like
> > > > > preferring the special case instead of the common case. Almost
> noone
> > > > > filters for specific NaN value bit-patterns. SQL engines that don't
> > > have
> > > > > IEEE total ordering as their default ordering for floats will also
> > need
> > > > to
> > > > > do more special handling for this.
> > > > >
> > > > > I also agree with my colleague that doing an approach that is 50%
> of
> > > the
> > > > > way there will make the barrier to improving it to what it actually
> > > > should
> > > > > be later on much higher.
> > > > >
> > > > > As for ways forward, I propose merging the `nan_count` and `sort
> > > > ordering`
> > > > > proposals into one to make one proposal, as they are linked
> together,
> > > and
> > > > > moving forward with one without knowing what will happen to the
> other
> > > > seems
> > > > > unwise. From a Polars perspective, having a `nan_count` and
> defining
> > > what
> > > > > happens to the `min` and `max` statistics when a page contains only
> > > NaNs
> > > > is
> > > > > enough to allow for all predicate filtering. I think, but correct
> me
> > > if I
> > > > > am wrong, this is also enough for all SQL engines that don't use
> > total
> > > > > ordering. But if you want to be impartial to the engine's
> > > floating-point
> > > > > ordering and allow engines with total ordering to do inequality
> > filters
> > > > > when `nan_count > 0` you would need a `positive_nan_count` and a
> > > > > `negative_nan_count`. I understand the downside with Thrift
> > complexity,
> > > > but
> > > > > introducing another sort order is also adding complexity just in a
> > > > > different place.
> > > > >
> > > > > I would really like to see this move forward, so I hope these
> > concerns
> > > > help
> > > > > move it forward towards a solution that works for everyone.
> > > > >
> > > > > Kind regards,
> > > > > Gijs
> > > > >
> > > > >
> > > > > On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <
> andrewlam...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I would also be in favor of starting a vote
> > > > > >
> > > > > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <jpfi...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > As the author of both the IEEE754 total order
> > > > > > > <https://github.com/apache/parquet-format/pull/221> PR and the
> > > > earlier
> > > > > > PR
> > > > > > > that basically proposed `nan_count`
> > > > > > > <https://github.com/apache/parquet-format/pull/196>, my
> current
> > > vote
> > > > > > would
> > > > > > > be for IEEE754 total order.
> > > > > > > Consequently, I would like to request a formal vote for the PR
> > > > > > introducing
> > > > > > > IEEE754 total order (
> > > > https://github.com/apache/parquet-format/pull/221
> > > > > ),
> > > > > > > if
> > > > > > > that is possible.
> > > > > > >
> > > > > > > My Rationales:
> > > > > > >
> > > > > > >    - It's conceptually simpler. It's easier to explain. It's
> > based
> > > on
> > > > > an
> > > > > > >    IEEE-standardized order predicate.
> > > > > > >    - There are already multiple implementations showing
> > > feasibility.
> > > > > This
> > > > > > >    will likely make the adoption quicker.
> > > > > > >    - It gives a defined order to every bit pattern and thus
> > yields
> > > a
> > > > > > total
> > > > > > >    order, mathematically speaking, which has value by itself.
> > With
> > > > NaN
> > > > > > > counts,
> > > > > > >    it was still undefined how different bit patterns of NaNs
> were
> > > > > > supposed
> > > > > > > to
> > > > > > >    be ordered, whether NaN was allowed to have a sign bit,
> etc.,
> > > > > risking
> > > > > > > that
> > > > > > >    different engines could come to different results while
> > > filtering
> > > > or
> > > > > > >    sorting values within a file.
> > > > > > >    - It also solves sort order completely. With nan_counts
> only,
> > it
> > > > is
> > > > > > >    still undefined whether nans should be sorted before or
> after
> > > all
> > > > > > values
> > > > > > >    (or both, depending on sign bit), so any file including NaNs
> > > could
> > > > > not
> > > > > > >    really leverage sort order without being ambiguous.
> > > > > > >    - It's less complex in thrift. Having fields that only apply
> > to
> > > a
> > > > > > >    handful of data types is somehow weird. If every type did
> > this,
> > > we
> > > > > > would
> > > > > > >    have a plethora of non-generic fields in thrift.
> > > > > > >    - Treating NaNs so specially is giving them attention they
> > don't
> > > > > > >    deserve. Most data sets do not contain NaNs. If a use case
> > > really
> > > > > > > requires
> > > > > > >    them and needs filtering to ignore them, they can store NULL
> > > > > instead,
> > > > > > >    or encode them differently. I would prefer the average case
> > over
> > > > the
> > > > > > >    special case here.
> > > > > > >    - The majority of the people discussing this so far seem to
> > > favor
> > > > > > total
> > > > > > >    order.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jan
> > > > > > >
> > > > > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <
> > > ust...@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > As this discussion has been open for more than two years, I’d
> > > like
> > > > to
> > > > > > > bump
> > > > > > > > up
> > > > > > > > this thread again to update the progress and collect
> feedback.
> > > > > > > >
> > > > > > > > *Background*
> > > > > > > > • Today Parquet’s min/max stats and page index omit NaNs
> > > entirely.
> > > > > > > > • Engines can’t safely prune floating values because they
> know
> > > > > nothing
> > > > > > on
> > > > > > > > NaNs.
> > > > > > > > • Column index is disabled if any page contains only NaNs.
> > > > > > > >
> > > > > > > > There are two active proposals as below:
> > > > > > > >
> > > > > > > > *Proposal A - IEEE754TotalOrder* (from the PR [1])
> > > > > > > > • Define a new ColumnOrder to include +0, –0 and all NaN
> > > > > bit‐patterns.
> > > > > > > > • Stats and column index store NaNs if they appear.
> > > > > > > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and
> > > > > parquet-java
> > > > > > > [4].
> > > > > > > > • For more context of this approach, please refer to
> discussion
> > > in
> > > > > [5].
> > > > > > > >
> > > > > > > > *Proposal B - add nan_count* (from a comment [6] to [1])
> > > > > > > > • Add `nan_count` to stats and a `nan_counts` list to column
> > > index.
> > > > > > > > • For all‐NaNs cases, write NaN to min/max and use nan_count
> to
> > > > > > > > distinguish.
> > > > > > > >
> > > > > > > > Both solutions have pros and cons but are way better than the
> > > > status
> > > > > > quo
> > > > > > > > today.
> > > > > > > > Please share your thoughts on the two proposals above, or
> maybe
> > > > come
> > > > > up
> > > > > > > > with
> > > > > > > > better alternatives. We need consensus on one proposal and
> move
> > > > > > forward.
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/parquet-format/pull/221
> > > > > > > > [2] https://github.com/apache/arrow-rs/pull/7408
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > > > > > > > [4] https://github.com/apache/parquet-java/pull/3191
> > > > > > > > [5] https://github.com/apache/parquet-format/pull/196
> > > > > > > > [6]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Gang
> > > > > > > >
> > > > > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <jpfi...@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Dear contributors,
> > > > > > > > >
> > > > > > > > > My PR has now gathered comments for a week and the gist of
> > all
> > > > open
> > > > > > > > issues
> > > > > > > > > is the question of how to encode pages/column chunks that
> > > contain
> > > > > > only
> > > > > > > > > NaNs. There are different suggestions and I don't see one
> > > common
> > > > > > > favorite
> > > > > > > > > yet.
> > > > > > > > >
> > > > > > > > > I have outlined three alternatives of how we can handle
> these
> > > > and I
> > > > > > > want
> > > > > > > > us
> > > > > > > > > to reach a conclusion here, so I can update my PR
> accordingly
> > > and
> > > > > > move
> > > > > > > on
> > > > > > > > > with it. As this is my first contribution to parquet, I
> don't
> > > > know
> > > > > > the
> > > > > > > > > decision processes here. Do we vote? Is there a single or
> > group
> > > > of
> > > > > > > > decision
> > > > > > > > > makers? *Please let me know how to come to a conclusion
> here;
> > > > what
> > > > > > are
> > > > > > > > the
> > > > > > > > > next steps?*
> > > > > > > > >
> > > > > > > > > For reference, here are the three alternatives I pointed
> out.
> > > You
> > > > > can
> > > > > > > > find
> > > > > > > > > detailed description of their PROs and CONs in my comment:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > > > > > > > >
> > > > > > > > > 1. My initial proposal, i.e., encoding only-NaN pages by
> > > > > min=max=NaN.
> > > > > > > > > 2. Adding `num_values` to the ColumnIndex, to make it
> > symmetric
> > > > > with
> > > > > > > > > Statistics in pages & `ColumnMetaData` and to enable the
> > > > > computation
> > > > > > > > > `num_values - null_count - nan_count == 0`
> > > > > > > > > 3. Adding a `nan_pages` bool list to the column index,
> which
> > > > > > indicates
> > > > > > > > > whether a page contains only NaNs
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > > Jan Finis
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to