Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Micah Kornfield Tue, 12 May 2026 14:01:35 -0700

I left some minor comments on the spec, but overall it looks good to me.

I have concerns about the Rust implementation which seems to hard-code
writing out the new sort order.  It would be good to understand if this
would break old readers (i.e. how do they handle unknown sort order at the
thrift parsing stage).  I have less of a concern if like Java this is an
opt-in feature for now.


Thanks,
Micah

On Mon, May 11, 2026 at 10:56 PM Gang Wu <[email protected]> wrote:

> Just bumping this one last time.
>
> No objections have been raised since the last round of reviews on the
> spec PR and PoC implementations.
>
> If this is still the case, I'll start a formal vote for the spec PR
> https://github.com/apache/parquet-format/pull/514 early next week.
>
> Best,
> Gang
>
> On Fri, Apr 24, 2026 at 11:49 PM Gang Wu <[email protected]> wrote:
> >
> > Update on the progress of PARQUET-2249.
> >
> > We now have two complete PoC implementations for the combined IEEE 754
> total order and nan_count approach:
> > - Java: https://github.com/apache/parquet-java/pull/3393
> > - Rust: https://github.com/apache/arrow-rs/pull/9619 (Thanks Ed!)
> >
> > The spec PR is available here:
> https://github.com/apache/parquet-format/pull/514
> >
> > We have also added a test file to parquet-testing for interoperability
> tests, which has been verified by both parquet-java and arrow-rs:
> > https://github.com/apache/parquet-testing/pull/104
> >
> > I'd like to encourage everyone to take another look at the current
> proposal and implementation. Any feedback or suggestions are welcome. If
> there are no further objections, I will move forward with a formal vote.
> >
> > Best regards,
> > Gang
> >
> > On Mon, Mar 16, 2026 at 11:30 AM Gang Wu <[email protected]> wrote:
> >>
> >> Thanks Zehua! Really appreciate it!
> >>
> >> On Mon, Mar 16, 2026 at 10:40 AM Zehua Zou <[email protected]>
> wrote:
> >>>
> >>> Hello Gang and others,
> >>>
> >>> I am willing to implement the C++ POC.
> >>>
> >>>
> >>>
> >>> > 2026年3月14日 23:56，Gang Wu <[email protected]> 写道：
> >>> >
> >>> > Update:
> >>> >
> >>> > Java POC is ready for IEEE 754 column order combined with nan_count:
> >>> > https://github.com/apache/parquet-java/pull/3393
> >>> >
> >>> > The spec PR has been updated earlier to address all comments:
> >>> > https://github.com/apache/parquet-format/pull/514
> >>> >
> >>> > Really appreciate any review and feedback!
> >>> >
> >>> > Best,
> >>> > Gang
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote:
> >>> >
> >>> >> Hello all,
> >>> >>
> >>> >> I'm reaching out to help drive this long-running discussion—nearly
> >>> >> three years now—towards a final resolution. With Jan's
> authorization,
> >>> >> and my sincere thanks for his sustained effort, I want to help push
> >>> >> this issue to the finish line.
> >>> >>
> >>> >> To recap, we have two primary proposals on how to handle NaNs in
> >>> >> statistics and column indexes:
> >>> >>
> >>> >> * IEEE 754 Total Order [1]: Proposes adding a new column order
> >>> >> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined
> >>> >> ordering for every float bit pattern, including NaNs and -0/+0,
> >>> >> allowing writers to include NaNs in min/max and removing ambiguity
> for
> >>> >> only-NaN pages.
> >>> >> * Combined Approach [2]: Proposes adopting the IEEE 754 total order
> >>> >> alongside explicit nan_count(s) fields. This approach mandates the
> >>> >> nan_count(s) when the new order is used and clarifies how to handle
> >>> >> edge cases from legacy writers.
> >>> >>
> >>> >> Based on the recent comments, it appears the combined approach [2]
> is
> >>> >> gaining consensus, although the IEEE 754 total order [1] still has
> >>> >> strong advocates.
> >>> >>
> >>> >> I agree with the sentiment that technical direction should be made
> by
> >>> >> consensus, not a vote. To that end, I'd like to solicit further
> >>> >> feedback specifically on the combined approach [2] to see if we can
> >>> >> achieve the necessary consensus to move forward now.
> >>> >>
> >>> >> I recall that the total order proposal [1] already has three PoC
> >>> >> implementations. For the combined approach [2], I can draft a PoC in
> >>> >> parquet-java, but to meet the two-implementation requirement, we
> would
> >>> >> need one more contributor to step up.
> >>> >>
> >>> >> [1] https://github.com/apache/parquet-format/pull/221
> >>> >> [2] https://github.com/apache/parquet-format/pull/514
> >>> >>
> >>> >> Best,
> >>> >> Gang
> >>> >>
> >>> >>
> >>> >> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn
> <[email protected]>
> >>> >> wrote:
> >>> >>>
> >>> >>> Hello Jan,
> >>> >>>
> >>> >>> Thank you for pushing this through. Apart from some smaller nits,
> we also
> >>> >>> really like the current proposal.
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Gijs
> >>> >>>
> >>> >>> On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <
> [email protected]>
> >>> >> wrote:
> >>> >>>
> >>> >>>> I have started organizing a project[1] in arrow-rs 's Parquet
> reader
> >>> >> to try
> >>> >>>> and implement this proposal.
> >>> >>>>
> >>> >>>> Hopefully that can be 1 / 2 open source implementations needed.
> >>> >>>>
> >>> >>>> Thanks again for helping drive this along,
> >>> >>>> Andrew
> >>> >>>>
> >>> >>>> [1] https://github.com/apache/arrow-rs/issues/8156
> >>> >>>>
> >>> >>>> On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]>
> wrote:
> >>> >>>>
> >>> >>>>> I have now tagged
> >>> >>>>> <
> >>> >>>>
> >>> >>
> https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173
> >>> >>>>>>
> >>> >>>>> the people that argued for total order in the initial PR. Let's
> see
> >>> >> their
> >>> >>>>> response.
> >>> >>>>>
> >>> >>>>> If I understand the adoption process correctly, the next hurdle
> to
> >>> >>>> getting
> >>> >>>>> this adopted is two open
> >>> >>>>> source (!) implementations proving its feasibility. We already
> had
> >>> >> that
> >>> >>>> for
> >>> >>>>> IEEE total order. If we
> >>> >>>>> prefer the solution with nan counts, we'll need it there as
> well. I
> >>> >>>> myself
> >>> >>>>> work on a proprietary
> >>> >>>>> implementation, so I'm counting on others here :). Be prepared
> >>> >> though,
> >>> >>>> this
> >>> >>>>> will likely take months
> >>> >>>>> unless the interest in this topic has risen to a point where
> people
> >>> >> are
> >>> >>>>> eager to jump on the implementation
> >>> >>>>> right away.
> >>> >>>>>
> >>> >>>>> So, I guess it will take some months of soaking time before any
> >>> >> formal
> >>> >>>> vote
> >>> >>>>> can be done
> >>> >>>>> (given that we reach consensus that this is what we want and we
> find
> >>> >>>> people
> >>> >>>>> for the implementations).
> >>> >>>>>
> >>> >>>>> Cheers,
> >>> >>>>> Jan
> >>> >>>>>
> >>> >>>>> Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue <
> >>> >> [email protected]>:
> >>> >>>>>
> >>> >>>>>> Thanks, Jan. I also went through the combined proposal and it
> looks
> >>> >>>>> mostly
> >>> >>>>>> good to me.
> >>> >>>>>>
> >>> >>>>>>> First of all, to make it quick: Yes, the solution of having
> >>> >>>> nan_counts
> >>> >>>>>> *and* total order, which was brought up multiple times, does
> work
> >>> >> and
> >>> >>>>>> solves more cases than just either of both.
> >>> >>>>>>
> >>> >>>>>> Great, then we have a solution for both filtering use cases and
> for
> >>> >>>>> moving
> >>> >>>>>> ahead with total order. And thanks to Andrew for suggesting
> this as
> >>> >>>> well
> >>> >>>>> on
> >>> >>>>>> the second PR. I think this also looks like this is something
> that
> >>> >>>> Orson
> >>> >>>>> is
> >>> >>>>>> okay with given his comments on the latest PR.
> >>> >>>>>>
> >>> >>>>>> Is there anyone against the combined approach? I don't see a big
> >>> >>>> downside
> >>> >>>>>> for anyone. It is compatible with previous stats rules, has a
> NaN
> >>> >>>> count,
> >>> >>>>>> and allows using either type-specific order or total order.
> >>> >>>>>>
> >>> >>>>>> Assuming that this satisfies the big objections, I think we
> should
> >>> >> wait
> >>> >>>>> for
> >>> >>>>>> a few days to make sure everyone has time to check out the new
> PR
> >>> >> and
> >>> >>>>> then
> >>> >>>>>> vote to adopt it.
> >>> >>>>>>
> >>> >>>>>> Ryan
> >>> >>>>>>
> >>> >>>>>> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb <
> >>> >> [email protected]>
> >>> >>>>>> wrote:
> >>> >>>>>>
> >>> >>>>>>> Thank you Jan -- I read through the new combined proposal, and
> I
> >>> >>>>> thought
> >>> >>>>>> it
> >>> >>>>>>> looks good and addresses the feedback so far. I left some small
> >>> >> style
> >>> >>>>>>> suggestions, but nothing that is required from my perspective
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]>
> >>> >> wrote:
> >>> >>>>>>>
> >>> >>>>>>>> Hey Ryan,
> >>> >>>>>>>>
> >>> >>>>>>>> Thanks for chiming in. First of all, to make it quick: Yes,
> the
> >>> >>>>>> solution
> >>> >>>>>>> of
> >>> >>>>>>>> having nan_counts *and* total order, which was brought up
> >>> >> multiple
> >>> >>>>>> times,
> >>> >>>>>>>> does work and solves more cases than just either of both.
> >>> >>>>>>>>
> >>> >>>>>>>> I strongly prefer continuing to discuss the merits of these
> >>> >>>>> approaches
> >>> >>>>>>>>> rather than trying to decide with a vote.
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> In theory, I agree that it isn't good to silence a discussion
> >>> >> by
> >>> >>>> just
> >>> >>>>>>>> voting for one possible solution and technical issues should
> be
> >>> >>>>>>> discussed.
> >>> >>>>>>>> However, please note that we have been circling on this for
> >>> >> over
> >>> >>>> two
> >>> >>>>>>> years
> >>> >>>>>>>> now, including an extended discussion that brought up all
> >>> >> arguments
> >>> >>>>>>>> multiple times. This is in stark contrast to the
> >>> >>>>>>>> speed with which you guys work on the Iceberg spec, for
> >>> >> example.
> >>> >>>>> There,
> >>> >>>>>>> you
> >>> >>>>>>>> also do not discuss the merits of various solutions for
> >>> >> multiple
> >>> >>>>> years.
> >>> >>>>>>> You
> >>> >>>>>>>> just pick one and merge it after a *reasonable* time of
> >>> >> discussion.
> >>> >>>>>>>> If you had the speed we currently have here, nothing would get
> >>> >>>> done.
> >>> >>>>>>> Thus,
> >>> >>>>>>>> I see this as a clear case of *"the perfect is the enemy of
> the
> >>> >>>>> good"*.
> >>> >>>>>>>> Yes, we can continue looking for the perfect solution,
> >>> >>>>>>>> but that will likely lead to keeping us at the status quo,
> >>> >> which is
> >>> >>>>> the
> >>> >>>>>>>> worst of them all.
> >>> >>>>>>>>
> >>> >>>>>>>> That being said, I'm also happy to create a PR which does both
> >>> >>>> total
> >>> >>>>>>> order
> >>> >>>>>>>> and NaN counts; after all, I just want the issue solved and
> all
> >>> >>>> these
> >>> >>>>>>>> solutions are better than the status quo.
> >>> >>>>>>>>
> >>> >>>>>>>> *As this was now suggest by at least three people, I guess
> it's
> >>> >>>> worth
> >>> >>>>>>>> doing, so here you go:
> >>> >>>>>> https://github.com/apache/parquet-format/pull/514
> >>> >>>>>>>> <https://github.com/apache/parquet-format/pull/514>*
> >>> >>>>>>>>
> >>> >>>>>>>> With this, we should have PRs covering most of the solution
> >>> >> space.
> >>> >>>>>>>> (I'm refusing to create a PR with negative and positive
> >>> >> nan_counts;
> >>> >>>>>>>> nan_counts + total order has to suffice; the complexity
> >>> >> madness has
> >>> >>>>> to
> >>> >>>>>>> stop
> >>> >>>>>>>> somewhere)
> >>> >>>>>>>> I still believe that there was an amount of people who already
> >>> >>>> found
> >>> >>>>>>>> nan_counts too complex and therefore wanted IEEE total order,
> >>> >> and
> >>> >>>>> these
> >>> >>>>>>>> people may not like putting on extra complexity,
> >>> >>>>>>>> but let's see, maybe some have also changed their opinion in
> >>> >> the
> >>> >>>>>>> meantime.
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> *Given all this, we can also first do an informal vote where
> >>> >>>> everyone
> >>> >>>>>> can
> >>> >>>>>>>> vote for which of the three their favorite would be.Maybe a
> >>> >> clear
> >>> >>>>>>> favorite
> >>> >>>>>>>> will emerge and then we can vote on this one.*
> >>> >>>>>>>>
> >>> >>>>>>>> But of course, we can also take some weeks to discuss the
> three
> >>> >>>>>>> solutions,
> >>> >>>>>>>> now that we have PRs for all of them. I just hope this won't
> >>> >> make
> >>> >>>> us
> >>> >>>>>>>> continue for another 2 years, or an
> >>> >>>>>>>> infinite stalemate where each solution is vetoed by a PMC
> >>> >> member.
> >>> >>>>>>>> (Sorry for becoming a bit cynical here; I have just spent way
> >>> >> too
> >>> >>>>> much
> >>> >>>>>>> time
> >>> >>>>>>>> of my life with double statistics at this point ;) ...)
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> Cheers,
> >>> >>>>>>>> Jan
> >>> >>>>>>>>
> >>> >>>>>>>> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <
> >>> >>>>> [email protected]
> >>> >>>>>>> :
> >>> >>>>>>>>
> >>> >>>>>>>>> Regarding the process for this, I strongly prefer continuing
> >>> >> to
> >>> >>>>>> discuss
> >>> >>>>>>>> the
> >>> >>>>>>>>> merits of these approaches rather than trying to decide with
> >>> >> a
> >>> >>>>> vote.
> >>> >>>>>> I
> >>> >>>>>>>>> don't think it is a good practice to use a vote to decide on
> >>> >> a
> >>> >>>>>>> technical
> >>> >>>>>>>>> direction. There are very few situations that warrant it and
> >>> >> I
> >>> >>>>> don't
> >>> >>>>>>>> think
> >>> >>>>>>>>> that this is one of them. While this issue has been open for
> >>> >> a
> >>> >>>> long
> >>> >>>>>>> time,
> >>> >>>>>>>>> that appears to be the result of it not being anyone's top
> >>> >>>> priority
> >>> >>>>>>>> rather
> >>> >>>>>>>>> than indecision.
> >>> >>>>>>>>>
> >>> >>>>>>>>> For the technical merits of these approaches, I think that
> >>> >> we can
> >>> >>>>>> find
> >>> >>>>>>> a
> >>> >>>>>>>>> middle ground. I agree with Jan that when working with sorted
> >>> >>>>> values,
> >>> >>>>>>> we
> >>> >>>>>>>>> need to know how NaN values were handled and that requires
> >>> >> using
> >>> >>>> a
> >>> >>>>>>>>> well-defined order that includes NaN and its variations
> >>> >> (because
> >>> >>>> we
> >>> >>>>>>>> should
> >>> >>>>>>>>> not normalize). Using NaN count is not sufficient for
> >>> >> ordering
> >>> >>>>> rows.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Gijs also brings up good points about how NaN values show up
> >>> >> in
> >>> >>>>>> actual
> >>> >>>>>>>>> datasets: not just when used in place of null, but also as
> >>> >> the
> >>> >>>>> result
> >>> >>>>>>> of
> >>> >>>>>>>>> normal calculations on abnormal data, like `sqrt(-4.0)` or
> >>> >>>>>> `log(-1.0)`.
> >>> >>>>>>>>> Both of those present problems when mixed with valid data
> >>> >> because
> >>> >>>>> of
> >>> >>>>>>> the
> >>> >>>>>>>>> stats "poisoning" problem, where the range of valid data is
> >>> >>>> usable
> >>> >>>>>>> until
> >>> >>>>>>>> a
> >>> >>>>>>>>> single NaN is mixed in.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Another issue is that NaN is error-prone because "regular"
> >>> >>>>> comparison
> >>> >>>>>>> is
> >>> >>>>>>>>> always false:
> >>> >>>>>>>>> ```
> >>> >>>>>>>>> Math.log(-1.0) >= 2 => FALSE
> >>> >>>>>>>>> Math.log(-1.0) < 2 => FALSE
> >>> >>>>>>>>> 2 > Math.log(-1.0) => FALSE
> >>> >>>>>>>>> ```
> >>> >>>>>>>>>
> >>> >>>>>>>>> As a result, Iceberg doesn't trust NaN values as either
> >>> >> lower or
> >>> >>>>>> upper
> >>> >>>>>>>>> bounds because we don't want to go back to the code that
> >>> >> produced
> >>> >>>>> the
> >>> >>>>>>>> value
> >>> >>>>>>>>> to see what the comparison order was to determine whether NaN
> >>> >>>>> values
> >>> >>>>>> go
> >>> >>>>>>>>> before or after others.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Total order solves the second issue in theory, but regular
> >>> >>>>> comparison
> >>> >>>>>>> is
> >>> >>>>>>>>> prevalent and not obvious to developers. And it also doesn't
> >>> >> help
> >>> >>>>>> when
> >>> >>>>>>>> NaN
> >>> >>>>>>>>> is used instead of null. So using total order is not
> >>> >> sufficient
> >>> >>>> for
> >>> >>>>>>> data
> >>> >>>>>>>>> skipping.
> >>> >>>>>>>>>
> >>> >>>>>>>>> I think the right compromise is to use `min`, `max`, and
> >>> >>>>> `nan_count`
> >>> >>>>>>> for
> >>> >>>>>>>>> data skipping stats (where min and max cannot be NaN) and
> >>> >> total
> >>> >>>>>>> ordering
> >>> >>>>>>>>> for sorting values. That satisfies the data skipping use
> >>> >> cases
> >>> >>>> and
> >>> >>>>>> also
> >>> >>>>>>>>> gives us an ordering of unaltered values that we can reason
> >>> >>>> about.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Does anyone think that doesn't work?
> >>> >>>>>>>>>
> >>> >>>>>>>>> Ryan
> >>> >>>>>>>>>
> >>> >>>>>>>>> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]>
> >>> >> wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>>> Thanks Jan for your endless effort on this!
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> I'm in favor of simplicity and generalism. I think we have
> >>> >>>>> already
> >>> >>>>>>>>> debated
> >>> >>>>>>>>>> a lot
> >>> >>>>>>>>>> for `nan_count` in [1] and [2] is the reflection of those
> >>> >>>>>>> discussions.
> >>> >>>>>>>>>> Therefore
> >>> >>>>>>>>>> I am inclined to start a vote for [2] unless there is a
> >>> >>>>>> significantly
> >>> >>>>>>>>>> better
> >>> >>>>>>>>>> proposal.
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> I would suggest everyone interested in this discussion to
> >>> >>>> attend
> >>> >>>>>> the
> >>> >>>>>>>>>> scheduled
> >>> >>>>>>>>>> sync on Aug 6th (detailed below) to spread the word to the
> >>> >>>>> broader
> >>> >>>>>>>>>> community.
> >>> >>>>>>>>>> If we can get a consensus on [2], I can help start the
> >>> >> vote and
> >>> >>>>>> move
> >>> >>>>>>>>>> forward.
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> *Apache Parquet Community Sync Wednesday, August 6 · 10:00
> >>> >> –
> >>> >>>>>> 11:00am
> >>> >>>>>>> *
> >>> >>>>>>>>>> *Time zone: America/Los_Angeles*
> >>> >>>>>>>>>> *Google Meet joining info Video call link:
> >>> >>>>>>>>>> https://meet.google.com/bhe-rvan-qjk
> >>> >>>>>>>>>> <https://meet.google.com/bhe-rvan-qjk> *
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> [1] https://github.com/apache/parquet-format/pull/196
> >>> >>>>>>>>>> [2] https://github.com/apache/parquet-format/pull/221
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> Best,
> >>> >>>>>>>>>> Gang
> >>> >>>>>>>>>>
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <
> >>> >> [email protected]>
> >>> >>>>>> wrote:
> >>> >>>>>>>>>>
> >>> >>>>>>>>>>> Hi Gijs,
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> Thank you for bringing up concrete points, I'm happy to
> >>> >>>> discuss
> >>> >>>>>>> them
> >>> >>>>>>>> in
> >>> >>>>>>>>>>> detail.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> NaNs are less common in the SQL world than in the
> >>> >> DataFrame
> >>> >>>>> world
> >>> >>>>>>>> where
> >>> >>>>>>>>>>>> NaNs were used for a long time to represent missing
> >>> >> values.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> You could transcode between NULL to NaN before reading
> >>> >> and
> >>> >>>>>> writing
> >>> >>>>>>> to
> >>> >>>>>>>>>>> Parquet. You basically mention yourself that NaNs were
> >>> >> used
> >>> >>>> for
> >>> >>>>>>>> missing
> >>> >>>>>>>>>>> values, i.e., what is commonly a NULL, which wasn't
> >>> >>>> available.
> >>> >>>>>> So,
> >>> >>>>>>>>>>> semantically, transcoding to NULL would even be the sane
> >>> >>>> thing
> >>> >>>>> to
> >>> >>>>>>> do.
> >>> >>>>>>>>>> Yes,
> >>> >>>>>>>>>>> that will cost you some cycles, but should be a rather
> >>> >>>>>> lightweight
> >>> >>>>>>>>>>> operation in comparison to most other operations, so I
> >>> >> would
> >>> >>>>>> argue
> >>> >>>>>>>> that
> >>> >>>>>>>>>> it
> >>> >>>>>>>>>>> won't totally ruin your performance. Similarly, why
> >>> >> should
> >>> >>>>>> Parquet
> >>> >>>>>>>> play
> >>> >>>>>>>>>>> along with a "hack" that was done in other frameworks
> >>> >> due to
> >>> >>>>>>>>> shortcomings
> >>> >>>>>>>>>>> of those frameworks? So from a philosophical point of
> >>> >> view, I
> >>> >>>>>> think
> >>> >>>>>>>>>>> supporting NaNs better is the wrong thing to do. Rather,
> >>> >> we
> >>> >>>>>> should
> >>> >>>>>>>> be a
> >>> >>>>>>>>>>> forcing function to align others to better behavior, so
> >>> >>>>> appling a
> >>> >>>>>>> bit
> >>> >>>>>>>>> of
> >>> >>>>>>>>>>> force might in the long run make people use NULLs also in
> >>> >>>>>>> DataFrames.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> Of course, your argument also goes into the direction of
> >>> >>>>>>> pragmatism:
> >>> >>>>>>>>> If a
> >>> >>>>>>>>>>> large part of the data science world uses NaNs to encode
> >>> >>>>> missing
> >>> >>>>>>>>> values,
> >>> >>>>>>>>>>> then maybe Parquet should accept this de-facto standard
> >>> >>>> rather
> >>> >>>>>> than
> >>> >>>>>>>>>>> fighting it. That is indeed a valid point. The weight of
> >>> >> it
> >>> >>>> is
> >>> >>>>>>>>> debatable
> >>> >>>>>>>>>>> and my personal conclusion is that it's still not worth
> >>> >> it,
> >>> >>>> as
> >>> >>>>>> you
> >>> >>>>>>>> can
> >>> >>>>>>>>>>> transcode between NULLs and NaNs, but I do agree with its
> >>> >>>>>> validity.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> Since the proposal phrases it as a goal to work
> >>> >> "regardless
> >>> >>>> of
> >>> >>>>>> how
> >>> >>>>>>>> they
> >>> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
> >>> >>>>>> out-of-place
> >>> >>>>>>> to
> >>> >>>>>>>>> me.
> >>> >>>>>>>>>>>> Most hardware and most people don't care about total
> >>> >>>> ordering
> >>> >>>>>> and
> >>> >>>>>>>>>> needing
> >>> >>>>>>>>>>>> to take it into account while filtering using
> >>> >> statistics
> >>> >>>>> seems
> >>> >>>>>>> like
> >>> >>>>>>>>>>>> preferring the special case instead of the common case.
> >>> >>>>> Almost
> >>> >>>>>>>> noone
> >>> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
> >>> >> engines
> >>> >>>> that
> >>> >>>>>>> don't
> >>> >>>>>>>>>> have
> >>> >>>>>>>>>>>> IEEE total ordering as their default ordering for
> >>> >> floats
> >>> >>>> will
> >>> >>>>>>> also
> >>> >>>>>>>>> need
> >>> >>>>>>>>>>> to
> >>> >>>>>>>>>>>> do more special handling for this.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> I disagree with the conclusion this statement draws. The
> >>> >>>>> current
> >>> >>>>>>>>>> behavior,
> >>> >>>>>>>>>>> and nan_counts without total ordering, pose a real
> >>> >> problem
> >>> >>>>> here,
> >>> >>>>>>> even
> >>> >>>>>>>>> for
> >>> >>>>>>>>>>> engines that don't care about bit patterns. I do agree
> >>> >> that
> >>> >>>>> most
> >>> >>>>>>>>> database
> >>> >>>>>>>>>>> engines, including the one I'm working on, do not care
> >>> >> about
> >>> >>>>> bit
> >>> >>>>>>>>> patterns
> >>> >>>>>>>>>>> and/or sign bits. However, how can our database engine
> >>> >> know
> >>> >>>>>> whether
> >>> >>>>>>>> the
> >>> >>>>>>>>>>> writer of a Parquet file saw it the same way? It can't.
> >>> >>>>>> Therefore,
> >>> >>>>>>> it
> >>> >>>>>>>>>>> cannot know whether a writer, for example, ordered NaNs
> >>> >>>> before
> >>> >>>>> or
> >>> >>>>>>>> after
> >>> >>>>>>>>>> all
> >>> >>>>>>>>>>> other numbers, or maybe ordered them by sign bit. So, if
> >>> >> our
> >>> >>>>>>> database
> >>> >>>>>>>>>>> engine now sees a float column in sorting columns, it
> >>> >> cannot
> >>> >>>>>> apply
> >>> >>>>>>>> any
> >>> >>>>>>>>>>> optimization without a lot of special casing, as it
> >>> >> doesn't
> >>> >>>>> know
> >>> >>>>>>>>> whether
> >>> >>>>>>>>>>> NaNs will be before all other values, after all other
> >>> >> values,
> >>> >>>>> or
> >>> >>>>>>>> maybe
> >>> >>>>>>>>>>> both, depending on sign bit. It could apply contrived
> >>> >> logic
> >>> >>>>> that
> >>> >>>>>>>> tries
> >>> >>>>>>>>> to
> >>> >>>>>>>>>>> infer where NaNs were placed from the NaN counts of the
> >>> >> first
> >>> >>>>> and
> >>> >>>>>>>> last
> >>> >>>>>>>>>>> page, but doing so will be a lot of ugly code that also
> >>> >> feels
> >>> >>>>> to
> >>> >>>>>> be
> >>> >>>>>>>> in
> >>> >>>>>>>>>> the
> >>> >>>>>>>>>>> wrong place. I.e., I don't want to need to load pages or
> >>> >> the
> >>> >>>>> page
> >>> >>>>>>>>> index,
> >>> >>>>>>>>>>> just to reason about a sort order.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> SQL engines that don't have
> >>> >>>>>>>>>>>> IEEE total ordering as their default ordering for
> >>> >> floats
> >>> >>>> will
> >>> >>>>>>> also
> >>> >>>>>>>>> need
> >>> >>>>>>>>>>> to
> >>> >>>>>>>>>>>> do more special handling for this.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> This code, which I would indeed need to write for our
> >>> >> engine,
> >>> >>>>> is
> >>> >>>>>>>>>> comparably
> >>> >>>>>>>>>>> trivial. Simply choose the largest possible bit pattern
> >>> >> as
> >>> >>>>>>> comparison
> >>> >>>>>>>>> for
> >>> >>>>>>>>>>> upper bounds filtering for NaN, and the smallest
> >>> >> possible bit
> >>> >>>>>>> pattern
> >>> >>>>>>>>> for
> >>> >>>>>>>>>>> lower bounds. It's not more than a few lines of code that
> >>> >>>> check
> >>> >>>>>>>>> whether a
> >>> >>>>>>>>>>> filter is NaN and then replace its value with the
> >>> >>>>> highest/lowest
> >>> >>>>>>> NaN
> >>> >>>>>>>>> bit
> >>> >>>>>>>>>>> pattern. It is similarly trivial to the special casing I
> >>> >> need
> >>> >>>>> to
> >>> >>>>>> do
> >>> >>>>>>>>> with
> >>> >>>>>>>>>>> nan_counts, and it is way more trivial than the extra
> >>> >> code I
> >>> >>>>>> would
> >>> >>>>>>>> need
> >>> >>>>>>>>>> to
> >>> >>>>>>>>>>> write for sorting columns, as depicted above.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> From a Polars perspective, having a `nan_count` and
> >>> >> defining
> >>> >>>>> what
> >>> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page
> >>> >>>>> contains
> >>> >>>>>>> only
> >>> >>>>>>>>>> NaNs
> >>> >>>>>>>>>>> is
> >>> >>>>>>>>>>>> enough to allow for all predicate filtering. I think,
> >>> >> but
> >>> >>>>>> correct
> >>> >>>>>>>> me
> >>> >>>>>>>>>> if I
> >>> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
> >>> >>>> don't
> >>> >>>>>> use
> >>> >>>>>>>>> total
> >>> >>>>>>>>>>>> ordering.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> It's not fully enough, as depicted above. Sorting columns
> >>> >>>> would
> >>> >>>>>>> still
> >>> >>>>>>>>> not
> >>> >>>>>>>>>>> work properly.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
> >>> >> and
> >>> >>>>> `sort
> >>> >>>>>>>>>> ordering`
> >>> >>>>>>>>>>>> proposals into one to make one proposal
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> Note that the initial reason for proposing IEEE total
> >>> >> order
> >>> >>>> was
> >>> >>>>>>> that
> >>> >>>>>>>>>> people
> >>> >>>>>>>>>>> in the discussion threads found nan_counts to be too
> >>> >> complex
> >>> >>>>> and
> >>> >>>>>>> too
> >>> >>>>>>>>> much
> >>> >>>>>>>>>>> of an undeserving special case (re-read the discussion
> >>> >> in the
> >>> >>>>>>> initial
> >>> >>>>>>>>> PR
> >>> >>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196> to
> >>> >> see
> >>> >>>> the
> >>> >>>>>>>>>>> rationales).
> >>> >>>>>>>>>>> So merging both together would go totally against the
> >>> >> spirit
> >>> >>>> of
> >>> >>>>>> why
> >>> >>>>>>>>> IEEE
> >>> >>>>>>>>>>> total order was proposed. While it has further upsides,
> >>> >> the
> >>> >>>>> main
> >>> >>>>>>>> reason
> >>> >>>>>>>>>> was
> >>> >>>>>>>>>>> indeed to *not have* nan_counts. If now the proposal
> >>> >> would
> >>> >>>> even
> >>> >>>>>> go
> >>> >>>>>>> to
> >>> >>>>>>>>>>> positive and negative nan counts (i.e., even more
> >>> >>>> complexity),
> >>> >>>>>> this
> >>> >>>>>>>>> would
> >>> >>>>>>>>>>> go 180 degrees into the opposite direction of why people
> >>> >>>> wanted
> >>> >>>>>>> total
> >>> >>>>>>>>>> order
> >>> >>>>>>>>>>> in the first place.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> Cheers,
> >>> >>>>>>>>>>> Jan
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
> >>> >>>>>>>>>>> <[email protected]>:
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>> Hello Jan and others,
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> First, let me preface by saying I am quite new here.
> >>> >> So I
> >>> >>>>>>> apologize
> >>> >>>>>>>>> if
> >>> >>>>>>>>>>>> there is some other better way to bring up these
> >>> >> concerns.
> >>> >>>> I
> >>> >>>>>>>>> understand
> >>> >>>>>>>>>>> it
> >>> >>>>>>>>>>>> is very annoying to come in at the 11th hour and start
> >>> >>>>> bringing
> >>> >>>>>>> up
> >>> >>>>>>>> a
> >>> >>>>>>>>>>> bunch
> >>> >>>>>>>>>>>> of concerns, but I would also like this to be done
> >>> >> right. A
> >>> >>>>>>>> colleague
> >>> >>>>>>>>>> of
> >>> >>>>>>>>>>>> mine brought up some concerns and alternative
> >>> >> approaches in
> >>> >>>>> the
> >>> >>>>>>>>> GitHub
> >>> >>>>>>>>>>>> thread; I will file some of the concerns here as a
> >>> >>>> response.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>>> Treating NaNs so specially is giving them attention
> >>> >> they
> >>> >>>>>> don't
> >>> >>>>>>>>>> deserve.
> >>> >>>>>>>>>>>> Most data sets do not contain NaNs. If a use case
> >>> >> really
> >>> >>>>>> requires
> >>> >>>>>>>>> them
> >>> >>>>>>>>>>> and
> >>> >>>>>>>>>>>> needs filtering to ignore them, they can store NULL
> >>> >>>> instead,
> >>> >>>>> or
> >>> >>>>>>>>> encode
> >>> >>>>>>>>>>> them
> >>> >>>>>>>>>>>> differently. I would prefer the average case over the
> >>> >>>> special
> >>> >>>>>>> case
> >>> >>>>>>>>>> here.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> NaNs are less common in the SQL world than in the
> >>> >> DataFrame
> >>> >>>>>> world
> >>> >>>>>>>>> where
> >>> >>>>>>>>>>>> NaNs were used for a long time to represent missing
> >>> >> values.
> >>> >>>>>> They
> >>> >>>>>>>>> still
> >>> >>>>>>>>>>>> exist with different canonical representations and
> >>> >>>> different
> >>> >>>>>> sign
> >>> >>>>>>>>>> bits. I
> >>> >>>>>>>>>>>> agree it might not be correct semantically, but sadly
> >>> >> that
> >>> >>>> is
> >>> >>>>>> the
> >>> >>>>>>>>> world
> >>> >>>>>>>>>>> we
> >>> >>>>>>>>>>>> deal with. NumPy and Numba do not have missing data
> >>> >>>>>>> functionality,
> >>> >>>>>>>>>> people
> >>> >>>>>>>>>>>> use NaNs there, and people definitely use that in their
> >>> >>>>>>> analytical
> >>> >>>>>>>>>>>> dataflows. Another point that was brought up in the GH
> >>> >>>>>> discussion
> >>> >>>>>>>> was
> >>> >>>>>>>>>>> "what
> >>> >>>>>>>>>>>> about infinity? You could argue that having infinity in
> >>> >>>>>>> statistics
> >>> >>>>>>>> is
> >>> >>>>>>>>>>>> similarly unuseful as it's too wide of a bound". I
> >>> >> would
> >>> >>>>> argue
> >>> >>>>>>> that
> >>> >>>>>>>>>>>> infinity is very different as there is no discussion on
> >>> >>>> what
> >>> >>>>>> the
> >>> >>>>>>>>>> ordering
> >>> >>>>>>>>>>>> or pattern of infinity is. Everyone agrees that
> >>> >> `min(1.0,
> >>> >>>>> inf,
> >>> >>>>>>>> -inf)
> >>> >>>>>>>>> ==
> >>> >>>>>>>>>>>> -inf` and each infinity only has a single bit pattern.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>>> It gives a defined order to every bit pattern and
> >>> >> thus
> >>> >>>>>> yields a
> >>> >>>>>>>>> total
> >>> >>>>>>>>>>>> order, mathematically speaking, which has value by
> >>> >> itself.
> >>> >>>>> With
> >>> >>>>>>> NaN
> >>> >>>>>>>>>>> counts,
> >>> >>>>>>>>>>>> it was still undefined how different bit patterns of
> >>> >> NaNs
> >>> >>>>> were
> >>> >>>>>>>>> supposed
> >>> >>>>>>>>>>> to
> >>> >>>>>>>>>>>> be ordered, whether NaN was allowed to have a sign bit,
> >>> >>>> etc.,
> >>> >>>>>>>> risking
> >>> >>>>>>>>>>> that
> >>> >>>>>>>>>>>> different engines could come to different results while
> >>> >>>>>> filtering
> >>> >>>>>>>> or
> >>> >>>>>>>>>>>> sorting values within a file.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> Since the proposal phrases it as a goal to work
> >>> >> "regardless
> >>> >>>>> of
> >>> >>>>>>> how
> >>> >>>>>>>>> they
> >>> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
> >>> >>>>>> out-of-place
> >>> >>>>>>> to
> >>> >>>>>>>>> me.
> >>> >>>>>>>>>>>> Most hardware and most people don't care about total
> >>> >>>> ordering
> >>> >>>>>> and
> >>> >>>>>>>>>> needing
> >>> >>>>>>>>>>>> to take it into account while filtering using
> >>> >> statistics
> >>> >>>>> seems
> >>> >>>>>>> like
> >>> >>>>>>>>>>>> preferring the special case instead of the common case.
> >>> >>>>> Almost
> >>> >>>>>>>> noone
> >>> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
> >>> >> engines
> >>> >>>> that
> >>> >>>>>>> don't
> >>> >>>>>>>>>> have
> >>> >>>>>>>>>>>> IEEE total ordering as their default ordering for
> >>> >> floats
> >>> >>>> will
> >>> >>>>>>> also
> >>> >>>>>>>>> need
> >>> >>>>>>>>>>> to
> >>> >>>>>>>>>>>> do more special handling for this.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> I also agree with my colleague that doing an approach
> >>> >> that
> >>> >>>> is
> >>> >>>>>> 50%
> >>> >>>>>>>> of
> >>> >>>>>>>>>> the
> >>> >>>>>>>>>>>> way there will make the barrier to improving it to
> >>> >> what it
> >>> >>>>>>> actually
> >>> >>>>>>>>>>> should
> >>> >>>>>>>>>>>> be later on much higher.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
> >>> >> and
> >>> >>>>>> `sort
> >>> >>>>>>>>>>> ordering`
> >>> >>>>>>>>>>>> proposals into one to make one proposal, as they are
> >>> >> linked
> >>> >>>>>>>> together,
> >>> >>>>>>>>>> and
> >>> >>>>>>>>>>>> moving forward with one without knowing what will
> >>> >> happen to
> >>> >>>>> the
> >>> >>>>>>>> other
> >>> >>>>>>>>>>> seems
> >>> >>>>>>>>>>>> unwise. From a Polars perspective, having a
> >>> >> `nan_count` and
> >>> >>>>>>>> defining
> >>> >>>>>>>>>> what
> >>> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page
> >>> >>>>> contains
> >>> >>>>>>> only
> >>> >>>>>>>>>> NaNs
> >>> >>>>>>>>>>> is
> >>> >>>>>>>>>>>> enough to allow for all predicate filtering. I think,
> >>> >> but
> >>> >>>>>> correct
> >>> >>>>>>>> me
> >>> >>>>>>>>>> if I
> >>> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
> >>> >>>> don't
> >>> >>>>>> use
> >>> >>>>>>>>> total
> >>> >>>>>>>>>>>> ordering. But if you want to be impartial to the
> >>> >> engine's
> >>> >>>>>>>>>> floating-point
> >>> >>>>>>>>>>>> ordering and allow engines with total ordering to do
> >>> >>>>> inequality
> >>> >>>>>>>>> filters
> >>> >>>>>>>>>>>> when `nan_count > 0` you would need a
> >>> >> `positive_nan_count`
> >>> >>>>> and
> >>> >>>>>> a
> >>> >>>>>>>>>>>> `negative_nan_count`. I understand the downside with
> >>> >> Thrift
> >>> >>>>>>>>> complexity,
> >>> >>>>>>>>>>> but
> >>> >>>>>>>>>>>> introducing another sort order is also adding
> >>> >> complexity
> >>> >>>> just
> >>> >>>>>> in
> >>> >>>>>>> a
> >>> >>>>>>>>>>>> different place.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> I would really like to see this move forward, so I hope
> >>> >>>> these
> >>> >>>>>>>>> concerns
> >>> >>>>>>>>>>> help
> >>> >>>>>>>>>>>> move it forward towards a solution that works for
> >>> >> everyone.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> Kind regards,
> >>> >>>>>>>>>>>> Gijs
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <
> >>> >>>>>>>> [email protected]>
> >>> >>>>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>>> I would also be in favor of starting a vote
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <
> >>> >>>>>> [email protected]>
> >>> >>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> As the author of both the IEEE754 total order
> >>> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/221>
> >>> >> PR
> >>> >>>>> and
> >>> >>>>>>> the
> >>> >>>>>>>>>>> earlier
> >>> >>>>>>>>>>>>> PR
> >>> >>>>>>>>>>>>>> that basically proposed `nan_count`
> >>> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196
> >>> >>> ,
> >>> >>>> my
> >>> >>>>>>>> current
> >>> >>>>>>>>>> vote
> >>> >>>>>>>>>>>>> would
> >>> >>>>>>>>>>>>>> be for IEEE754 total order.
> >>> >>>>>>>>>>>>>> Consequently, I would like to request a formal
> >>> >> vote for
> >>> >>>>> the
> >>> >>>>>>> PR
> >>> >>>>>>>>>>>>> introducing
> >>> >>>>>>>>>>>>>> IEEE754 total order (
> >>> >>>>>>>>>>> https://github.com/apache/parquet-format/pull/221
> >>> >>>>>>>>>>>> ),
> >>> >>>>>>>>>>>>>> if
> >>> >>>>>>>>>>>>>> that is possible.
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> My Rationales:
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>  - It's conceptually simpler. It's easier to
> >>> >> explain.
> >>> >>>>>> It's
> >>> >>>>>>>>> based
> >>> >>>>>>>>>> on
> >>> >>>>>>>>>>>> an
> >>> >>>>>>>>>>>>>>  IEEE-standardized order predicate.
> >>> >>>>>>>>>>>>>>  - There are already multiple implementations
> >>> >> showing
> >>> >>>>>>>>>> feasibility.
> >>> >>>>>>>>>>>> This
> >>> >>>>>>>>>>>>>>  will likely make the adoption quicker.
> >>> >>>>>>>>>>>>>>  - It gives a defined order to every bit pattern
> >>> >> and
> >>> >>>>> thus
> >>> >>>>>>>>> yields
> >>> >>>>>>>>>> a
> >>> >>>>>>>>>>>>> total
> >>> >>>>>>>>>>>>>>  order, mathematically speaking, which has value
> >>> >> by
> >>> >>>>>> itself.
> >>> >>>>>>>>> With
> >>> >>>>>>>>>>> NaN
> >>> >>>>>>>>>>>>>> counts,
> >>> >>>>>>>>>>>>>>  it was still undefined how different bit
> >>> >> patterns of
> >>> >>>>>> NaNs
> >>> >>>>>>>> were
> >>> >>>>>>>>>>>>> supposed
> >>> >>>>>>>>>>>>>> to
> >>> >>>>>>>>>>>>>>  be ordered, whether NaN was allowed to have a
> >>> >> sign
> >>> >>>>> bit,
> >>> >>>>>>>> etc.,
> >>> >>>>>>>>>>>> risking
> >>> >>>>>>>>>>>>>> that
> >>> >>>>>>>>>>>>>>  different engines could come to different
> >>> >> results
> >>> >>>>> while
> >>> >>>>>>>>>> filtering
> >>> >>>>>>>>>>> or
> >>> >>>>>>>>>>>>>>  sorting values within a file.
> >>> >>>>>>>>>>>>>>  - It also solves sort order completely. With
> >>> >>>>> nan_counts
> >>> >>>>>>>> only,
> >>> >>>>>>>>> it
> >>> >>>>>>>>>>> is
> >>> >>>>>>>>>>>>>>  still undefined whether nans should be sorted
> >>> >> before
> >>> >>>>> or
> >>> >>>>>>>> after
> >>> >>>>>>>>>> all
> >>> >>>>>>>>>>>>> values
> >>> >>>>>>>>>>>>>>  (or both, depending on sign bit), so any file
> >>> >>>>> including
> >>> >>>>>>> NaNs
> >>> >>>>>>>>>> could
> >>> >>>>>>>>>>>> not
> >>> >>>>>>>>>>>>>>  really leverage sort order without being
> >>> >> ambiguous.
> >>> >>>>>>>>>>>>>>  - It's less complex in thrift. Having fields
> >>> >> that
> >>> >>>> only
> >>> >>>>>>> apply
> >>> >>>>>>>>> to
> >>> >>>>>>>>>> a
> >>> >>>>>>>>>>>>>>  handful of data types is somehow weird. If every
> >>> >>>> type
> >>> >>>>>> did
> >>> >>>>>>>>> this,
> >>> >>>>>>>>>> we
> >>> >>>>>>>>>>>>> would
> >>> >>>>>>>>>>>>>>  have a plethora of non-generic fields in thrift.
> >>> >>>>>>>>>>>>>>  - Treating NaNs so specially is giving them
> >>> >>>> attention
> >>> >>>>>> they
> >>> >>>>>>>>> don't
> >>> >>>>>>>>>>>>>>  deserve. Most data sets do not contain NaNs. If
> >>> >> a
> >>> >>>> use
> >>> >>>>>> case
> >>> >>>>>>>>>> really
> >>> >>>>>>>>>>>>>> requires
> >>> >>>>>>>>>>>>>>  them and needs filtering to ignore them, they
> >>> >> can
> >>> >>>>> store
> >>> >>>>>>> NULL
> >>> >>>>>>>>>>>> instead,
> >>> >>>>>>>>>>>>>>  or encode them differently. I would prefer the
> >>> >>>> average
> >>> >>>>>>> case
> >>> >>>>>>>>> over
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>  special case here.
> >>> >>>>>>>>>>>>>>  - The majority of the people discussing this so
> >>> >> far
> >>> >>>>> seem
> >>> >>>>>>> to
> >>> >>>>>>>>>> favor
> >>> >>>>>>>>>>>>> total
> >>> >>>>>>>>>>>>>>  order.
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> Cheers,
> >>> >>>>>>>>>>>>>> Jan
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu
> >>> >> <
> >>> >>>>>>>>>> [email protected]
> >>> >>>>>>>>>>>> :
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> Hi all,
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> As this discussion has been open for more than
> >>> >> two
> >>> >>>>> years,
> >>> >>>>>>> I’d
> >>> >>>>>>>>>> like
> >>> >>>>>>>>>>> to
> >>> >>>>>>>>>>>>>> bump
> >>> >>>>>>>>>>>>>>> up
> >>> >>>>>>>>>>>>>>> this thread again to update the progress and
> >>> >> collect
> >>> >>>>>>>> feedback.
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> *Background*
> >>> >>>>>>>>>>>>>>> • Today Parquet’s min/max stats and page index
> >>> >> omit
> >>> >>>>> NaNs
> >>> >>>>>>>>>> entirely.
> >>> >>>>>>>>>>>>>>> • Engines can’t safely prune floating values
> >>> >> because
> >>> >>>>> they
> >>> >>>>>>>> know
> >>> >>>>>>>>>>>> nothing
> >>> >>>>>>>>>>>>> on
> >>> >>>>>>>>>>>>>>> NaNs.
> >>> >>>>>>>>>>>>>>> • Column index is disabled if any page contains
> >>> >> only
> >>> >>>>>> NaNs.
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> There are two active proposals as below:
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> *Proposal A - IEEE754TotalOrder* (from the PR
> >>> >> [1])
> >>> >>>>>>>>>>>>>>> • Define a new ColumnOrder to include +0, –0 and
> >>> >> all
> >>> >>>>> NaN
> >>> >>>>>>>>>>>> bit‐patterns.
> >>> >>>>>>>>>>>>>>> • Stats and column index store NaNs if they
> >>> >> appear.
> >>> >>>>>>>>>>>>>>> • Three PoC impls are ready: arrow-rs [2],
> >>> >> duckdb [3]
> >>> >>>>> and
> >>> >>>>>>>>>>>> parquet-java
> >>> >>>>>>>>>>>>>> [4].
> >>> >>>>>>>>>>>>>>> • For more context of this approach, please
> >>> >> refer to
> >>> >>>>>>>> discussion
> >>> >>>>>>>>>> in
> >>> >>>>>>>>>>>> [5].
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> *Proposal B - add nan_count* (from a comment [6]
> >>> >> to
> >>> >>>>> [1])
> >>> >>>>>>>>>>>>>>> • Add `nan_count` to stats and a `nan_counts`
> >>> >> list to
> >>> >>>>>>> column
> >>> >>>>>>>>>> index.
> >>> >>>>>>>>>>>>>>> • For all‐NaNs cases, write NaN to min/max and
> >>> >> use
> >>> >>>>>>> nan_count
> >>> >>>>>>>> to
> >>> >>>>>>>>>>>>>>> distinguish.
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> Both solutions have pros and cons but are way
> >>> >> better
> >>> >>>>> than
> >>> >>>>>>> the
> >>> >>>>>>>>>>> status
> >>> >>>>>>>>>>>>> quo
> >>> >>>>>>>>>>>>>>> today.
> >>> >>>>>>>>>>>>>>> Please share your thoughts on the two proposals
> >>> >>>> above,
> >>> >>>>> or
> >>> >>>>>>>> maybe
> >>> >>>>>>>>>>> come
> >>> >>>>>>>>>>>> up
> >>> >>>>>>>>>>>>>>> with
> >>> >>>>>>>>>>>>>>> better alternatives. We need consensus on one
> >>> >>>> proposal
> >>> >>>>>> and
> >>> >>>>>>>> move
> >>> >>>>>>>>>>>>> forward.
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> [1]
> >>> >>>> https://github.com/apache/parquet-format/pull/221
> >>> >>>>>>>>>>>>>>> [2] https://github.com/apache/arrow-rs/pull/7408
> >>> >>>>>>>>>>>>>>> [3]
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>
> >>> >>>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> >>> >>>>>>>>>>>>>>> [4]
> >>> >> https://github.com/apache/parquet-java/pull/3191
> >>> >>>>>>>>>>>>>>> [5]
> >>> >>>> https://github.com/apache/parquet-format/pull/196
> >>> >>>>>>>>>>>>>>> [6]
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>
> >>> >>>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> Best,
> >>> >>>>>>>>>>>>>>> Gang
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <
> >>> >>>>>>> [email protected]
> >>> >>>>>>>>>
> >>> >>>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> Dear contributors,
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> My PR has now gathered comments for a week and
> >>> >> the
> >>> >>>>> gist
> >>> >>>>>>> of
> >>> >>>>>>>>> all
> >>> >>>>>>>>>>> open
> >>> >>>>>>>>>>>>>>> issues
> >>> >>>>>>>>>>>>>>>> is the question of how to encode pages/column
> >>> >>>> chunks
> >>> >>>>>> that
> >>> >>>>>>>>>> contain
> >>> >>>>>>>>>>>>> only
> >>> >>>>>>>>>>>>>>>> NaNs. There are different suggestions and I
> >>> >> don't
> >>> >>>> see
> >>> >>>>>> one
> >>> >>>>>>>>>> common
> >>> >>>>>>>>>>>>>> favorite
> >>> >>>>>>>>>>>>>>>> yet.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> I have outlined three alternatives of how we
> >>> >> can
> >>> >>>>> handle
> >>> >>>>>>>> these
> >>> >>>>>>>>>>> and I
> >>> >>>>>>>>>>>>>> want
> >>> >>>>>>>>>>>>>>> us
> >>> >>>>>>>>>>>>>>>> to reach a conclusion here, so I can update my
> >>> >> PR
> >>> >>>>>>>> accordingly
> >>> >>>>>>>>>> and
> >>> >>>>>>>>>>>>> move
> >>> >>>>>>>>>>>>>> on
> >>> >>>>>>>>>>>>>>>> with it. As this is my first contribution to
> >>> >>>>> parquet, I
> >>> >>>>>>>> don't
> >>> >>>>>>>>>>> know
> >>> >>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> decision processes here. Do we vote? Is there a
> >>> >>>>> single
> >>> >>>>>> or
> >>> >>>>>>>>> group
> >>> >>>>>>>>>>> of
> >>> >>>>>>>>>>>>>>> decision
> >>> >>>>>>>>>>>>>>>> makers? *Please let me know how to come to a
> >>> >>>>> conclusion
> >>> >>>>>>>> here;
> >>> >>>>>>>>>>> what
> >>> >>>>>>>>>>>>> are
> >>> >>>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> next steps?*
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> For reference, here are the three alternatives
> >>> >> I
> >>> >>>>>> pointed
> >>> >>>>>>>> out.
> >>> >>>>>>>>>> You
> >>> >>>>>>>>>>>> can
> >>> >>>>>>>>>>>>>>> find
> >>> >>>>>>>>>>>>>>>> detailed description of their PROs and CONs in
> >>> >> my
> >>> >>>>>>> comment:
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>
> >>> >>>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> 1. My initial proposal, i.e., encoding only-NaN
> >>> >>>> pages
> >>> >>>>>> by
> >>> >>>>>>>>>>>> min=max=NaN.
> >>> >>>>>>>>>>>>>>>> 2. Adding `num_values` to the ColumnIndex, to
> >>> >> make
> >>> >>>> it
> >>> >>>>>>>>> symmetric
> >>> >>>>>>>>>>>> with
> >>> >>>>>>>>>>>>>>>> Statistics in pages & `ColumnMetaData` and to
> >>> >>>> enable
> >>> >>>>>> the
> >>> >>>>>>>>>>>> computation
> >>> >>>>>>>>>>>>>>>> `num_values - null_count - nan_count == 0`
> >>> >>>>>>>>>>>>>>>> 3. Adding a `nan_pages` bool list to the column
> >>> >>>>> index,
> >>> >>>>>>>> which
> >>> >>>>>>>>>>>>> indicates
> >>> >>>>>>>>>>>>>>>> whether a page contains only NaNs
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> Cheers
> >>> >>>>>>>>>>>>>>>> Jan Finis
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>
> >>> >>>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>
> >>>
> >>>
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to