Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Gang Wu Sun, 15 Mar 2026 20:31:44 -0700

Thanks Zehua! Really appreciate it!

On Mon, Mar 16, 2026 at 10:40 AM Zehua Zou <[email protected]> wrote:


> Hello Gang and others,
>
> I am willing to implement the C++ POC.
>
>
>
> > 2026年3月14日 23:56，Gang Wu <[email protected]> 写道：
> >
> > Update:
> >
> > Java POC is ready for IEEE 754 column order combined with nan_count:
> > https://github.com/apache/parquet-java/pull/3393
> >
> > The spec PR has been updated earlier to address all comments:
> > https://github.com/apache/parquet-format/pull/514
> >
> > Really appreciate any review and feedback!
> >
> > Best,
> > Gang
> >
> >
> >
> >
> > On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote:
> >
> >> Hello all,
> >>
> >> I'm reaching out to help drive this long-running discussion—nearly
> >> three years now—towards a final resolution. With Jan's authorization,
> >> and my sincere thanks for his sustained effort, I want to help push
> >> this issue to the finish line.
> >>
> >> To recap, we have two primary proposals on how to handle NaNs in
> >> statistics and column indexes:
> >>
> >> * IEEE 754 Total Order [1]: Proposes adding a new column order
> >> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined
> >> ordering for every float bit pattern, including NaNs and -0/+0,
> >> allowing writers to include NaNs in min/max and removing ambiguity for
> >> only-NaN pages.
> >> * Combined Approach [2]: Proposes adopting the IEEE 754 total order
> >> alongside explicit nan_count(s) fields. This approach mandates the
> >> nan_count(s) when the new order is used and clarifies how to handle
> >> edge cases from legacy writers.
> >>
> >> Based on the recent comments, it appears the combined approach [2] is
> >> gaining consensus, although the IEEE 754 total order [1] still has
> >> strong advocates.
> >>
> >> I agree with the sentiment that technical direction should be made by
> >> consensus, not a vote. To that end, I'd like to solicit further
> >> feedback specifically on the combined approach [2] to see if we can
> >> achieve the necessary consensus to move forward now.
> >>
> >> I recall that the total order proposal [1] already has three PoC
> >> implementations. For the combined approach [2], I can draft a PoC in
> >> parquet-java, but to meet the two-implementation requirement, we would
> >> need one more contributor to step up.
> >>
> >> [1] https://github.com/apache/parquet-format/pull/221
> >> [2] https://github.com/apache/parquet-format/pull/514
> >>
> >> Best,
> >> Gang
> >>
> >>
> >> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn <[email protected]
> >
> >> wrote:
> >>>
> >>> Hello Jan,
> >>>
> >>> Thank you for pushing this through. Apart from some smaller nits, we
> also
> >>> really like the current proposal.
> >>>
> >>> Thanks,
> >>> Gijs
> >>>
> >>> On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]>
> >> wrote:
> >>>
> >>>> I have started organizing a project[1] in arrow-rs 's Parquet reader
> >> to try
> >>>> and implement this proposal.
> >>>>
> >>>> Hopefully that can be 1 / 2 open source implementations needed.
> >>>>
> >>>> Thanks again for helping drive this along,
> >>>> Andrew
> >>>>
> >>>> [1] https://github.com/apache/arrow-rs/issues/8156
> >>>>
> >>>> On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote:
> >>>>
> >>>>> I have now tagged
> >>>>> <
> >>>>
> >>
> https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173
> >>>>>>
> >>>>> the people that argued for total order in the initial PR. Let's see
> >> their
> >>>>> response.
> >>>>>
> >>>>> If I understand the adoption process correctly, the next hurdle to
> >>>> getting
> >>>>> this adopted is two open
> >>>>> source (!) implementations proving its feasibility. We already had
> >> that
> >>>> for
> >>>>> IEEE total order. If we
> >>>>> prefer the solution with nan counts, we'll need it there as well. I
> >>>> myself
> >>>>> work on a proprietary
> >>>>> implementation, so I'm counting on others here :). Be prepared
> >> though,
> >>>> this
> >>>>> will likely take months
> >>>>> unless the interest in this topic has risen to a point where people
> >> are
> >>>>> eager to jump on the implementation
> >>>>> right away.
> >>>>>
> >>>>> So, I guess it will take some months of soaking time before any
> >> formal
> >>>> vote
> >>>>> can be done
> >>>>> (given that we reach consensus that this is what we want and we find
> >>>> people
> >>>>> for the implementations).
> >>>>>
> >>>>> Cheers,
> >>>>> Jan
> >>>>>
> >>>>> Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue <
> >> [email protected]>:
> >>>>>
> >>>>>> Thanks, Jan. I also went through the combined proposal and it looks
> >>>>> mostly
> >>>>>> good to me.
> >>>>>>
> >>>>>>> First of all, to make it quick: Yes, the solution of having
> >>>> nan_counts
> >>>>>> *and* total order, which was brought up multiple times, does work
> >> and
> >>>>>> solves more cases than just either of both.
> >>>>>>
> >>>>>> Great, then we have a solution for both filtering use cases and for
> >>>>> moving
> >>>>>> ahead with total order. And thanks to Andrew for suggesting this as
> >>>> well
> >>>>> on
> >>>>>> the second PR. I think this also looks like this is something that
> >>>> Orson
> >>>>> is
> >>>>>> okay with given his comments on the latest PR.
> >>>>>>
> >>>>>> Is there anyone against the combined approach? I don't see a big
> >>>> downside
> >>>>>> for anyone. It is compatible with previous stats rules, has a NaN
> >>>> count,
> >>>>>> and allows using either type-specific order or total order.
> >>>>>>
> >>>>>> Assuming that this satisfies the big objections, I think we should
> >> wait
> >>>>> for
> >>>>>> a few days to make sure everyone has time to check out the new PR
> >> and
> >>>>> then
> >>>>>> vote to adopt it.
> >>>>>>
> >>>>>> Ryan
> >>>>>>
> >>>>>> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thank you Jan -- I read through the new combined proposal, and I
> >>>>> thought
> >>>>>> it
> >>>>>>> looks good and addresses the feedback so far. I left some small
> >> style
> >>>>>>> suggestions, but nothing that is required from my perspective
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]>
> >> wrote:
> >>>>>>>
> >>>>>>>> Hey Ryan,
> >>>>>>>>
> >>>>>>>> Thanks for chiming in. First of all, to make it quick: Yes, the
> >>>>>> solution
> >>>>>>> of
> >>>>>>>> having nan_counts *and* total order, which was brought up
> >> multiple
> >>>>>> times,
> >>>>>>>> does work and solves more cases than just either of both.
> >>>>>>>>
> >>>>>>>> I strongly prefer continuing to discuss the merits of these
> >>>>> approaches
> >>>>>>>>> rather than trying to decide with a vote.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> In theory, I agree that it isn't good to silence a discussion
> >> by
> >>>> just
> >>>>>>>> voting for one possible solution and technical issues should be
> >>>>>>> discussed.
> >>>>>>>> However, please note that we have been circling on this for
> >> over
> >>>> two
> >>>>>>> years
> >>>>>>>> now, including an extended discussion that brought up all
> >> arguments
> >>>>>>>> multiple times. This is in stark contrast to the
> >>>>>>>> speed with which you guys work on the Iceberg spec, for
> >> example.
> >>>>> There,
> >>>>>>> you
> >>>>>>>> also do not discuss the merits of various solutions for
> >> multiple
> >>>>> years.
> >>>>>>> You
> >>>>>>>> just pick one and merge it after a *reasonable* time of
> >> discussion.
> >>>>>>>> If you had the speed we currently have here, nothing would get
> >>>> done.
> >>>>>>> Thus,
> >>>>>>>> I see this as a clear case of *"the perfect is the enemy of the
> >>>>> good"*.
> >>>>>>>> Yes, we can continue looking for the perfect solution,
> >>>>>>>> but that will likely lead to keeping us at the status quo,
> >> which is
> >>>>> the
> >>>>>>>> worst of them all.
> >>>>>>>>
> >>>>>>>> That being said, I'm also happy to create a PR which does both
> >>>> total
> >>>>>>> order
> >>>>>>>> and NaN counts; after all, I just want the issue solved and all
> >>>> these
> >>>>>>>> solutions are better than the status quo.
> >>>>>>>>
> >>>>>>>> *As this was now suggest by at least three people, I guess it's
> >>>> worth
> >>>>>>>> doing, so here you go:
> >>>>>> https://github.com/apache/parquet-format/pull/514
> >>>>>>>> <https://github.com/apache/parquet-format/pull/514>*
> >>>>>>>>
> >>>>>>>> With this, we should have PRs covering most of the solution
> >> space.
> >>>>>>>> (I'm refusing to create a PR with negative and positive
> >> nan_counts;
> >>>>>>>> nan_counts + total order has to suffice; the complexity
> >> madness has
> >>>>> to
> >>>>>>> stop
> >>>>>>>> somewhere)
> >>>>>>>> I still believe that there was an amount of people who already
> >>>> found
> >>>>>>>> nan_counts too complex and therefore wanted IEEE total order,
> >> and
> >>>>> these
> >>>>>>>> people may not like putting on extra complexity,
> >>>>>>>> but let's see, maybe some have also changed their opinion in
> >> the
> >>>>>>> meantime.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *Given all this, we can also first do an informal vote where
> >>>> everyone
> >>>>>> can
> >>>>>>>> vote for which of the three their favorite would be.Maybe a
> >> clear
> >>>>>>> favorite
> >>>>>>>> will emerge and then we can vote on this one.*
> >>>>>>>>
> >>>>>>>> But of course, we can also take some weeks to discuss the three
> >>>>>>> solutions,
> >>>>>>>> now that we have PRs for all of them. I just hope this won't
> >> make
> >>>> us
> >>>>>>>> continue for another 2 years, or an
> >>>>>>>> infinite stalemate where each solution is vetoed by a PMC
> >> member.
> >>>>>>>> (Sorry for becoming a bit cynical here; I have just spent way
> >> too
> >>>>> much
> >>>>>>> time
> >>>>>>>> of my life with double statistics at this point ;) ...)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Jan
> >>>>>>>>
> >>>>>>>> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <
> >>>>> [email protected]
> >>>>>>> :
> >>>>>>>>
> >>>>>>>>> Regarding the process for this, I strongly prefer continuing
> >> to
> >>>>>> discuss
> >>>>>>>> the
> >>>>>>>>> merits of these approaches rather than trying to decide with
> >> a
> >>>>> vote.
> >>>>>> I
> >>>>>>>>> don't think it is a good practice to use a vote to decide on
> >> a
> >>>>>>> technical
> >>>>>>>>> direction. There are very few situations that warrant it and
> >> I
> >>>>> don't
> >>>>>>>> think
> >>>>>>>>> that this is one of them. While this issue has been open for
> >> a
> >>>> long
> >>>>>>> time,
> >>>>>>>>> that appears to be the result of it not being anyone's top
> >>>> priority
> >>>>>>>> rather
> >>>>>>>>> than indecision.
> >>>>>>>>>
> >>>>>>>>> For the technical merits of these approaches, I think that
> >> we can
> >>>>>> find
> >>>>>>> a
> >>>>>>>>> middle ground. I agree with Jan that when working with sorted
> >>>>> values,
> >>>>>>> we
> >>>>>>>>> need to know how NaN values were handled and that requires
> >> using
> >>>> a
> >>>>>>>>> well-defined order that includes NaN and its variations
> >> (because
> >>>> we
> >>>>>>>> should
> >>>>>>>>> not normalize). Using NaN count is not sufficient for
> >> ordering
> >>>>> rows.
> >>>>>>>>>
> >>>>>>>>> Gijs also brings up good points about how NaN values show up
> >> in
> >>>>>> actual
> >>>>>>>>> datasets: not just when used in place of null, but also as
> >> the
> >>>>> result
> >>>>>>> of
> >>>>>>>>> normal calculations on abnormal data, like `sqrt(-4.0)` or
> >>>>>> `log(-1.0)`.
> >>>>>>>>> Both of those present problems when mixed with valid data
> >> because
> >>>>> of
> >>>>>>> the
> >>>>>>>>> stats "poisoning" problem, where the range of valid data is
> >>>> usable
> >>>>>>> until
> >>>>>>>> a
> >>>>>>>>> single NaN is mixed in.
> >>>>>>>>>
> >>>>>>>>> Another issue is that NaN is error-prone because "regular"
> >>>>> comparison
> >>>>>>> is
> >>>>>>>>> always false:
> >>>>>>>>> ```
> >>>>>>>>> Math.log(-1.0) >= 2 => FALSE
> >>>>>>>>> Math.log(-1.0) < 2 => FALSE
> >>>>>>>>> 2 > Math.log(-1.0) => FALSE
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> As a result, Iceberg doesn't trust NaN values as either
> >> lower or
> >>>>>> upper
> >>>>>>>>> bounds because we don't want to go back to the code that
> >> produced
> >>>>> the
> >>>>>>>> value
> >>>>>>>>> to see what the comparison order was to determine whether NaN
> >>>>> values
> >>>>>> go
> >>>>>>>>> before or after others.
> >>>>>>>>>
> >>>>>>>>> Total order solves the second issue in theory, but regular
> >>>>> comparison
> >>>>>>> is
> >>>>>>>>> prevalent and not obvious to developers. And it also doesn't
> >> help
> >>>>>> when
> >>>>>>>> NaN
> >>>>>>>>> is used instead of null. So using total order is not
> >> sufficient
> >>>> for
> >>>>>>> data
> >>>>>>>>> skipping.
> >>>>>>>>>
> >>>>>>>>> I think the right compromise is to use `min`, `max`, and
> >>>>> `nan_count`
> >>>>>>> for
> >>>>>>>>> data skipping stats (where min and max cannot be NaN) and
> >> total
> >>>>>>> ordering
> >>>>>>>>> for sorting values. That satisfies the data skipping use
> >> cases
> >>>> and
> >>>>>> also
> >>>>>>>>> gives us an ordering of unaltered values that we can reason
> >>>> about.
> >>>>>>>>>
> >>>>>>>>> Does anyone think that doesn't work?
> >>>>>>>>>
> >>>>>>>>> Ryan
> >>>>>>>>>
> >>>>>>>>> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks Jan for your endless effort on this!
> >>>>>>>>>>
> >>>>>>>>>> I'm in favor of simplicity and generalism. I think we have
> >>>>> already
> >>>>>>>>> debated
> >>>>>>>>>> a lot
> >>>>>>>>>> for `nan_count` in [1] and [2] is the reflection of those
> >>>>>>> discussions.
> >>>>>>>>>> Therefore
> >>>>>>>>>> I am inclined to start a vote for [2] unless there is a
> >>>>>> significantly
> >>>>>>>>>> better
> >>>>>>>>>> proposal.
> >>>>>>>>>>
> >>>>>>>>>> I would suggest everyone interested in this discussion to
> >>>> attend
> >>>>>> the
> >>>>>>>>>> scheduled
> >>>>>>>>>> sync on Aug 6th (detailed below) to spread the word to the
> >>>>> broader
> >>>>>>>>>> community.
> >>>>>>>>>> If we can get a consensus on [2], I can help start the
> >> vote and
> >>>>>> move
> >>>>>>>>>> forward.
> >>>>>>>>>>
> >>>>>>>>>> *Apache Parquet Community Sync Wednesday, August 6 · 10:00
> >> –
> >>>>>> 11:00am
> >>>>>>> *
> >>>>>>>>>> *Time zone: America/Los_Angeles*
> >>>>>>>>>> *Google Meet joining info Video call link:
> >>>>>>>>>> https://meet.google.com/bhe-rvan-qjk
> >>>>>>>>>> <https://meet.google.com/bhe-rvan-qjk> *
> >>>>>>>>>>
> >>>>>>>>>> [1] https://github.com/apache/parquet-format/pull/196
> >>>>>>>>>> [2] https://github.com/apache/parquet-format/pull/221
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Gang
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Gijs,
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you for bringing up concrete points, I'm happy to
> >>>> discuss
> >>>>>>> them
> >>>>>>>> in
> >>>>>>>>>>> detail.
> >>>>>>>>>>>
> >>>>>>>>>>> NaNs are less common in the SQL world than in the
> >> DataFrame
> >>>>> world
> >>>>>>>> where
> >>>>>>>>>>>> NaNs were used for a long time to represent missing
> >> values.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> You could transcode between NULL to NaN before reading
> >> and
> >>>>>> writing
> >>>>>>> to
> >>>>>>>>>>> Parquet. You basically mention yourself that NaNs were
> >> used
> >>>> for
> >>>>>>>> missing
> >>>>>>>>>>> values, i.e., what is commonly a NULL, which wasn't
> >>>> available.
> >>>>>> So,
> >>>>>>>>>>> semantically, transcoding to NULL would even be the sane
> >>>> thing
> >>>>> to
> >>>>>>> do.
> >>>>>>>>>> Yes,
> >>>>>>>>>>> that will cost you some cycles, but should be a rather
> >>>>>> lightweight
> >>>>>>>>>>> operation in comparison to most other operations, so I
> >> would
> >>>>>> argue
> >>>>>>>> that
> >>>>>>>>>> it
> >>>>>>>>>>> won't totally ruin your performance. Similarly, why
> >> should
> >>>>>> Parquet
> >>>>>>>> play
> >>>>>>>>>>> along with a "hack" that was done in other frameworks
> >> due to
> >>>>>>>>> shortcomings
> >>>>>>>>>>> of those frameworks? So from a philosophical point of
> >> view, I
> >>>>>> think
> >>>>>>>>>>> supporting NaNs better is the wrong thing to do. Rather,
> >> we
> >>>>>> should
> >>>>>>>> be a
> >>>>>>>>>>> forcing function to align others to better behavior, so
> >>>>> appling a
> >>>>>>> bit
> >>>>>>>>> of
> >>>>>>>>>>> force might in the long run make people use NULLs also in
> >>>>>>> DataFrames.
> >>>>>>>>>>>
> >>>>>>>>>>> Of course, your argument also goes into the direction of
> >>>>>>> pragmatism:
> >>>>>>>>> If a
> >>>>>>>>>>> large part of the data science world uses NaNs to encode
> >>>>> missing
> >>>>>>>>> values,
> >>>>>>>>>>> then maybe Parquet should accept this de-facto standard
> >>>> rather
> >>>>>> than
> >>>>>>>>>>> fighting it. That is indeed a valid point. The weight of
> >> it
> >>>> is
> >>>>>>>>> debatable
> >>>>>>>>>>> and my personal conclusion is that it's still not worth
> >> it,
> >>>> as
> >>>>>> you
> >>>>>>>> can
> >>>>>>>>>>> transcode between NULLs and NaNs, but I do agree with its
> >>>>>> validity.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Since the proposal phrases it as a goal to work
> >> "regardless
> >>>> of
> >>>>>> how
> >>>>>>>> they
> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
> >>>>>> out-of-place
> >>>>>>> to
> >>>>>>>>> me.
> >>>>>>>>>>>> Most hardware and most people don't care about total
> >>>> ordering
> >>>>>> and
> >>>>>>>>>> needing
> >>>>>>>>>>>> to take it into account while filtering using
> >> statistics
> >>>>> seems
> >>>>>>> like
> >>>>>>>>>>>> preferring the special case instead of the common case.
> >>>>> Almost
> >>>>>>>> noone
> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
> >> engines
> >>>> that
> >>>>>>> don't
> >>>>>>>>>> have
> >>>>>>>>>>>> IEEE total ordering as their default ordering for
> >> floats
> >>>> will
> >>>>>>> also
> >>>>>>>>> need
> >>>>>>>>>>> to
> >>>>>>>>>>>> do more special handling for this.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I disagree with the conclusion this statement draws. The
> >>>>> current
> >>>>>>>>>> behavior,
> >>>>>>>>>>> and nan_counts without total ordering, pose a real
> >> problem
> >>>>> here,
> >>>>>>> even
> >>>>>>>>> for
> >>>>>>>>>>> engines that don't care about bit patterns. I do agree
> >> that
> >>>>> most
> >>>>>>>>> database
> >>>>>>>>>>> engines, including the one I'm working on, do not care
> >> about
> >>>>> bit
> >>>>>>>>> patterns
> >>>>>>>>>>> and/or sign bits. However, how can our database engine
> >> know
> >>>>>> whether
> >>>>>>>> the
> >>>>>>>>>>> writer of a Parquet file saw it the same way? It can't.
> >>>>>> Therefore,
> >>>>>>> it
> >>>>>>>>>>> cannot know whether a writer, for example, ordered NaNs
> >>>> before
> >>>>> or
> >>>>>>>> after
> >>>>>>>>>> all
> >>>>>>>>>>> other numbers, or maybe ordered them by sign bit. So, if
> >> our
> >>>>>>> database
> >>>>>>>>>>> engine now sees a float column in sorting columns, it
> >> cannot
> >>>>>> apply
> >>>>>>>> any
> >>>>>>>>>>> optimization without a lot of special casing, as it
> >> doesn't
> >>>>> know
> >>>>>>>>> whether
> >>>>>>>>>>> NaNs will be before all other values, after all other
> >> values,
> >>>>> or
> >>>>>>>> maybe
> >>>>>>>>>>> both, depending on sign bit. It could apply contrived
> >> logic
> >>>>> that
> >>>>>>>> tries
> >>>>>>>>> to
> >>>>>>>>>>> infer where NaNs were placed from the NaN counts of the
> >> first
> >>>>> and
> >>>>>>>> last
> >>>>>>>>>>> page, but doing so will be a lot of ugly code that also
> >> feels
> >>>>> to
> >>>>>> be
> >>>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>> wrong place. I.e., I don't want to need to load pages or
> >> the
> >>>>> page
> >>>>>>>>> index,
> >>>>>>>>>>> just to reason about a sort order.
> >>>>>>>>>>>
> >>>>>>>>>>> SQL engines that don't have
> >>>>>>>>>>>> IEEE total ordering as their default ordering for
> >> floats
> >>>> will
> >>>>>>> also
> >>>>>>>>> need
> >>>>>>>>>>> to
> >>>>>>>>>>>> do more special handling for this.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> This code, which I would indeed need to write for our
> >> engine,
> >>>>> is
> >>>>>>>>>> comparably
> >>>>>>>>>>> trivial. Simply choose the largest possible bit pattern
> >> as
> >>>>>>> comparison
> >>>>>>>>> for
> >>>>>>>>>>> upper bounds filtering for NaN, and the smallest
> >> possible bit
> >>>>>>> pattern
> >>>>>>>>> for
> >>>>>>>>>>> lower bounds. It's not more than a few lines of code that
> >>>> check
> >>>>>>>>> whether a
> >>>>>>>>>>> filter is NaN and then replace its value with the
> >>>>> highest/lowest
> >>>>>>> NaN
> >>>>>>>>> bit
> >>>>>>>>>>> pattern. It is similarly trivial to the special casing I
> >> need
> >>>>> to
> >>>>>> do
> >>>>>>>>> with
> >>>>>>>>>>> nan_counts, and it is way more trivial than the extra
> >> code I
> >>>>>> would
> >>>>>>>> need
> >>>>>>>>>> to
> >>>>>>>>>>> write for sorting columns, as depicted above.
> >>>>>>>>>>>
> >>>>>>>>>>> From a Polars perspective, having a `nan_count` and
> >> defining
> >>>>> what
> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page
> >>>>> contains
> >>>>>>> only
> >>>>>>>>>> NaNs
> >>>>>>>>>>> is
> >>>>>>>>>>>> enough to allow for all predicate filtering. I think,
> >> but
> >>>>>> correct
> >>>>>>>> me
> >>>>>>>>>> if I
> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
> >>>> don't
> >>>>>> use
> >>>>>>>>> total
> >>>>>>>>>>>> ordering.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> It's not fully enough, as depicted above. Sorting columns
> >>>> would
> >>>>>>> still
> >>>>>>>>> not
> >>>>>>>>>>> work properly.
> >>>>>>>>>>>
> >>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
> >> and
> >>>>> `sort
> >>>>>>>>>> ordering`
> >>>>>>>>>>>> proposals into one to make one proposal
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Note that the initial reason for proposing IEEE total
> >> order
> >>>> was
> >>>>>>> that
> >>>>>>>>>> people
> >>>>>>>>>>> in the discussion threads found nan_counts to be too
> >> complex
> >>>>> and
> >>>>>>> too
> >>>>>>>>> much
> >>>>>>>>>>> of an undeserving special case (re-read the discussion
> >> in the
> >>>>>>> initial
> >>>>>>>>> PR
> >>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196> to
> >> see
> >>>> the
> >>>>>>>>>>> rationales).
> >>>>>>>>>>> So merging both together would go totally against the
> >> spirit
> >>>> of
> >>>>>> why
> >>>>>>>>> IEEE
> >>>>>>>>>>> total order was proposed. While it has further upsides,
> >> the
> >>>>> main
> >>>>>>>> reason
> >>>>>>>>>> was
> >>>>>>>>>>> indeed to *not have* nan_counts. If now the proposal
> >> would
> >>>> even
> >>>>>> go
> >>>>>>> to
> >>>>>>>>>>> positive and negative nan counts (i.e., even more
> >>>> complexity),
> >>>>>> this
> >>>>>>>>> would
> >>>>>>>>>>> go 180 degrees into the opposite direction of why people
> >>>> wanted
> >>>>>>> total
> >>>>>>>>>> order
> >>>>>>>>>>> in the first place.
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Jan
> >>>>>>>>>>>
> >>>>>>>>>>> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
> >>>>>>>>>>> <[email protected]>:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello Jan and others,
> >>>>>>>>>>>>
> >>>>>>>>>>>> First, let me preface by saying I am quite new here.
> >> So I
> >>>>>>> apologize
> >>>>>>>>> if
> >>>>>>>>>>>> there is some other better way to bring up these
> >> concerns.
> >>>> I
> >>>>>>>>> understand
> >>>>>>>>>>> it
> >>>>>>>>>>>> is very annoying to come in at the 11th hour and start
> >>>>> bringing
> >>>>>>> up
> >>>>>>>> a
> >>>>>>>>>>> bunch
> >>>>>>>>>>>> of concerns, but I would also like this to be done
> >> right. A
> >>>>>>>> colleague
> >>>>>>>>>> of
> >>>>>>>>>>>> mine brought up some concerns and alternative
> >> approaches in
> >>>>> the
> >>>>>>>>> GitHub
> >>>>>>>>>>>> thread; I will file some of the concerns here as a
> >>>> response.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Treating NaNs so specially is giving them attention
> >> they
> >>>>>> don't
> >>>>>>>>>> deserve.
> >>>>>>>>>>>> Most data sets do not contain NaNs. If a use case
> >> really
> >>>>>> requires
> >>>>>>>>> them
> >>>>>>>>>>> and
> >>>>>>>>>>>> needs filtering to ignore them, they can store NULL
> >>>> instead,
> >>>>> or
> >>>>>>>>> encode
> >>>>>>>>>>> them
> >>>>>>>>>>>> differently. I would prefer the average case over the
> >>>> special
> >>>>>>> case
> >>>>>>>>>> here.
> >>>>>>>>>>>>
> >>>>>>>>>>>> NaNs are less common in the SQL world than in the
> >> DataFrame
> >>>>>> world
> >>>>>>>>> where
> >>>>>>>>>>>> NaNs were used for a long time to represent missing
> >> values.
> >>>>>> They
> >>>>>>>>> still
> >>>>>>>>>>>> exist with different canonical representations and
> >>>> different
> >>>>>> sign
> >>>>>>>>>> bits. I
> >>>>>>>>>>>> agree it might not be correct semantically, but sadly
> >> that
> >>>> is
> >>>>>> the
> >>>>>>>>> world
> >>>>>>>>>>> we
> >>>>>>>>>>>> deal with. NumPy and Numba do not have missing data
> >>>>>>> functionality,
> >>>>>>>>>> people
> >>>>>>>>>>>> use NaNs there, and people definitely use that in their
> >>>>>>> analytical
> >>>>>>>>>>>> dataflows. Another point that was brought up in the GH
> >>>>>> discussion
> >>>>>>>> was
> >>>>>>>>>>> "what
> >>>>>>>>>>>> about infinity? You could argue that having infinity in
> >>>>>>> statistics
> >>>>>>>> is
> >>>>>>>>>>>> similarly unuseful as it's too wide of a bound". I
> >> would
> >>>>> argue
> >>>>>>> that
> >>>>>>>>>>>> infinity is very different as there is no discussion on
> >>>> what
> >>>>>> the
> >>>>>>>>>> ordering
> >>>>>>>>>>>> or pattern of infinity is. Everyone agrees that
> >> `min(1.0,
> >>>>> inf,
> >>>>>>>> -inf)
> >>>>>>>>> ==
> >>>>>>>>>>>> -inf` and each infinity only has a single bit pattern.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> It gives a defined order to every bit pattern and
> >> thus
> >>>>>> yields a
> >>>>>>>>> total
> >>>>>>>>>>>> order, mathematically speaking, which has value by
> >> itself.
> >>>>> With
> >>>>>>> NaN
> >>>>>>>>>>> counts,
> >>>>>>>>>>>> it was still undefined how different bit patterns of
> >> NaNs
> >>>>> were
> >>>>>>>>> supposed
> >>>>>>>>>>> to
> >>>>>>>>>>>> be ordered, whether NaN was allowed to have a sign bit,
> >>>> etc.,
> >>>>>>>> risking
> >>>>>>>>>>> that
> >>>>>>>>>>>> different engines could come to different results while
> >>>>>> filtering
> >>>>>>>> or
> >>>>>>>>>>>> sorting values within a file.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Since the proposal phrases it as a goal to work
> >> "regardless
> >>>>> of
> >>>>>>> how
> >>>>>>>>> they
> >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels
> >>>>>> out-of-place
> >>>>>>> to
> >>>>>>>>> me.
> >>>>>>>>>>>> Most hardware and most people don't care about total
> >>>> ordering
> >>>>>> and
> >>>>>>>>>> needing
> >>>>>>>>>>>> to take it into account while filtering using
> >> statistics
> >>>>> seems
> >>>>>>> like
> >>>>>>>>>>>> preferring the special case instead of the common case.
> >>>>> Almost
> >>>>>>>> noone
> >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL
> >> engines
> >>>> that
> >>>>>>> don't
> >>>>>>>>>> have
> >>>>>>>>>>>> IEEE total ordering as their default ordering for
> >> floats
> >>>> will
> >>>>>>> also
> >>>>>>>>> need
> >>>>>>>>>>> to
> >>>>>>>>>>>> do more special handling for this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I also agree with my colleague that doing an approach
> >> that
> >>>> is
> >>>>>> 50%
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>> way there will make the barrier to improving it to
> >> what it
> >>>>>>> actually
> >>>>>>>>>>> should
> >>>>>>>>>>>> be later on much higher.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As for ways forward, I propose merging the `nan_count`
> >> and
> >>>>>> `sort
> >>>>>>>>>>> ordering`
> >>>>>>>>>>>> proposals into one to make one proposal, as they are
> >> linked
> >>>>>>>> together,
> >>>>>>>>>> and
> >>>>>>>>>>>> moving forward with one without knowing what will
> >> happen to
> >>>>> the
> >>>>>>>> other
> >>>>>>>>>>> seems
> >>>>>>>>>>>> unwise. From a Polars perspective, having a
> >> `nan_count` and
> >>>>>>>> defining
> >>>>>>>>>> what
> >>>>>>>>>>>> happens to the `min` and `max` statistics when a page
> >>>>> contains
> >>>>>>> only
> >>>>>>>>>> NaNs
> >>>>>>>>>>> is
> >>>>>>>>>>>> enough to allow for all predicate filtering. I think,
> >> but
> >>>>>> correct
> >>>>>>>> me
> >>>>>>>>>> if I
> >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that
> >>>> don't
> >>>>>> use
> >>>>>>>>> total
> >>>>>>>>>>>> ordering. But if you want to be impartial to the
> >> engine's
> >>>>>>>>>> floating-point
> >>>>>>>>>>>> ordering and allow engines with total ordering to do
> >>>>> inequality
> >>>>>>>>> filters
> >>>>>>>>>>>> when `nan_count > 0` you would need a
> >> `positive_nan_count`
> >>>>> and
> >>>>>> a
> >>>>>>>>>>>> `negative_nan_count`. I understand the downside with
> >> Thrift
> >>>>>>>>> complexity,
> >>>>>>>>>>> but
> >>>>>>>>>>>> introducing another sort order is also adding
> >> complexity
> >>>> just
> >>>>>> in
> >>>>>>> a
> >>>>>>>>>>>> different place.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would really like to see this move forward, so I hope
> >>>> these
> >>>>>>>>> concerns
> >>>>>>>>>>> help
> >>>>>>>>>>>> move it forward towards a solution that works for
> >> everyone.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>> Gijs
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <
> >>>>>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I would also be in favor of starting a vote
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <
> >>>>>> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> As the author of both the IEEE754 total order
> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/221>
> >> PR
> >>>>> and
> >>>>>>> the
> >>>>>>>>>>> earlier
> >>>>>>>>>>>>> PR
> >>>>>>>>>>>>>> that basically proposed `nan_count`
> >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196
> >>> ,
> >>>> my
> >>>>>>>> current
> >>>>>>>>>> vote
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>> be for IEEE754 total order.
> >>>>>>>>>>>>>> Consequently, I would like to request a formal
> >> vote for
> >>>>> the
> >>>>>>> PR
> >>>>>>>>>>>>> introducing
> >>>>>>>>>>>>>> IEEE754 total order (
> >>>>>>>>>>> https://github.com/apache/parquet-format/pull/221
> >>>>>>>>>>>> ),
> >>>>>>>>>>>>>> if
> >>>>>>>>>>>>>> that is possible.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My Rationales:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  - It's conceptually simpler. It's easier to
> >> explain.
> >>>>>> It's
> >>>>>>>>> based
> >>>>>>>>>> on
> >>>>>>>>>>>> an
> >>>>>>>>>>>>>>  IEEE-standardized order predicate.
> >>>>>>>>>>>>>>  - There are already multiple implementations
> >> showing
> >>>>>>>>>> feasibility.
> >>>>>>>>>>>> This
> >>>>>>>>>>>>>>  will likely make the adoption quicker.
> >>>>>>>>>>>>>>  - It gives a defined order to every bit pattern
> >> and
> >>>>> thus
> >>>>>>>>> yields
> >>>>>>>>>> a
> >>>>>>>>>>>>> total
> >>>>>>>>>>>>>>  order, mathematically speaking, which has value
> >> by
> >>>>>> itself.
> >>>>>>>>> With
> >>>>>>>>>>> NaN
> >>>>>>>>>>>>>> counts,
> >>>>>>>>>>>>>>  it was still undefined how different bit
> >> patterns of
> >>>>>> NaNs
> >>>>>>>> were
> >>>>>>>>>>>>> supposed
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>  be ordered, whether NaN was allowed to have a
> >> sign
> >>>>> bit,
> >>>>>>>> etc.,
> >>>>>>>>>>>> risking
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>  different engines could come to different
> >> results
> >>>>> while
> >>>>>>>>>> filtering
> >>>>>>>>>>> or
> >>>>>>>>>>>>>>  sorting values within a file.
> >>>>>>>>>>>>>>  - It also solves sort order completely. With
> >>>>> nan_counts
> >>>>>>>> only,
> >>>>>>>>> it
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>  still undefined whether nans should be sorted
> >> before
> >>>>> or
> >>>>>>>> after
> >>>>>>>>>> all
> >>>>>>>>>>>>> values
> >>>>>>>>>>>>>>  (or both, depending on sign bit), so any file
> >>>>> including
> >>>>>>> NaNs
> >>>>>>>>>> could
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>>  really leverage sort order without being
> >> ambiguous.
> >>>>>>>>>>>>>>  - It's less complex in thrift. Having fields
> >> that
> >>>> only
> >>>>>>> apply
> >>>>>>>>> to
> >>>>>>>>>> a
> >>>>>>>>>>>>>>  handful of data types is somehow weird. If every
> >>>> type
> >>>>>> did
> >>>>>>>>> this,
> >>>>>>>>>> we
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>  have a plethora of non-generic fields in thrift.
> >>>>>>>>>>>>>>  - Treating NaNs so specially is giving them
> >>>> attention
> >>>>>> they
> >>>>>>>>> don't
> >>>>>>>>>>>>>>  deserve. Most data sets do not contain NaNs. If
> >> a
> >>>> use
> >>>>>> case
> >>>>>>>>>> really
> >>>>>>>>>>>>>> requires
> >>>>>>>>>>>>>>  them and needs filtering to ignore them, they
> >> can
> >>>>> store
> >>>>>>> NULL
> >>>>>>>>>>>> instead,
> >>>>>>>>>>>>>>  or encode them differently. I would prefer the
> >>>> average
> >>>>>>> case
> >>>>>>>>> over
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>  special case here.
> >>>>>>>>>>>>>>  - The majority of the people discussing this so
> >> far
> >>>>> seem
> >>>>>>> to
> >>>>>>>>>> favor
> >>>>>>>>>>>>> total
> >>>>>>>>>>>>>>  order.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Jan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu
> >> <
> >>>>>>>>>> [email protected]
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As this discussion has been open for more than
> >> two
> >>>>> years,
> >>>>>>> I’d
> >>>>>>>>>> like
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> bump
> >>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>> this thread again to update the progress and
> >> collect
> >>>>>>>> feedback.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Background*
> >>>>>>>>>>>>>>> • Today Parquet’s min/max stats and page index
> >> omit
> >>>>> NaNs
> >>>>>>>>>> entirely.
> >>>>>>>>>>>>>>> • Engines can’t safely prune floating values
> >> because
> >>>>> they
> >>>>>>>> know
> >>>>>>>>>>>> nothing
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>> NaNs.
> >>>>>>>>>>>>>>> • Column index is disabled if any page contains
> >> only
> >>>>>> NaNs.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> There are two active proposals as below:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Proposal A - IEEE754TotalOrder* (from the PR
> >> [1])
> >>>>>>>>>>>>>>> • Define a new ColumnOrder to include +0, –0 and
> >> all
> >>>>> NaN
> >>>>>>>>>>>> bit‐patterns.
> >>>>>>>>>>>>>>> • Stats and column index store NaNs if they
> >> appear.
> >>>>>>>>>>>>>>> • Three PoC impls are ready: arrow-rs [2],
> >> duckdb [3]
> >>>>> and
> >>>>>>>>>>>> parquet-java
> >>>>>>>>>>>>>> [4].
> >>>>>>>>>>>>>>> • For more context of this approach, please
> >> refer to
> >>>>>>>> discussion
> >>>>>>>>>> in
> >>>>>>>>>>>> [5].
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Proposal B - add nan_count* (from a comment [6]
> >> to
> >>>>> [1])
> >>>>>>>>>>>>>>> • Add `nan_count` to stats and a `nan_counts`
> >> list to
> >>>>>>> column
> >>>>>>>>>> index.
> >>>>>>>>>>>>>>> • For all‐NaNs cases, write NaN to min/max and
> >> use
> >>>>>>> nan_count
> >>>>>>>> to
> >>>>>>>>>>>>>>> distinguish.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Both solutions have pros and cons but are way
> >> better
> >>>>> than
> >>>>>>> the
> >>>>>>>>>>> status
> >>>>>>>>>>>>> quo
> >>>>>>>>>>>>>>> today.
> >>>>>>>>>>>>>>> Please share your thoughts on the two proposals
> >>>> above,
> >>>>> or
> >>>>>>>> maybe
> >>>>>>>>>>> come
> >>>>>>>>>>>> up
> >>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>> better alternatives. We need consensus on one
> >>>> proposal
> >>>>>> and
> >>>>>>>> move
> >>>>>>>>>>>>> forward.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]
> >>>> https://github.com/apache/parquet-format/pull/221
> >>>>>>>>>>>>>>> [2] https://github.com/apache/arrow-rs/pull/7408
> >>>>>>>>>>>>>>> [3]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> >>>>>>>>>>>>>>> [4]
> >> https://github.com/apache/parquet-java/pull/3191
> >>>>>>>>>>>>>>> [5]
> >>>> https://github.com/apache/parquet-format/pull/196
> >>>>>>>>>>>>>>> [6]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Gang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <
> >>>>>>> [email protected]
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Dear contributors,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My PR has now gathered comments for a week and
> >> the
> >>>>> gist
> >>>>>>> of
> >>>>>>>>> all
> >>>>>>>>>>> open
> >>>>>>>>>>>>>>> issues
> >>>>>>>>>>>>>>>> is the question of how to encode pages/column
> >>>> chunks
> >>>>>> that
> >>>>>>>>>> contain
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>> NaNs. There are different suggestions and I
> >> don't
> >>>> see
> >>>>>> one
> >>>>>>>>>> common
> >>>>>>>>>>>>>> favorite
> >>>>>>>>>>>>>>>> yet.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have outlined three alternatives of how we
> >> can
> >>>>> handle
> >>>>>>>> these
> >>>>>>>>>>> and I
> >>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>> us
> >>>>>>>>>>>>>>>> to reach a conclusion here, so I can update my
> >> PR
> >>>>>>>> accordingly
> >>>>>>>>>> and
> >>>>>>>>>>>>> move
> >>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>> with it. As this is my first contribution to
> >>>>> parquet, I
> >>>>>>>> don't
> >>>>>>>>>>> know
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> decision processes here. Do we vote? Is there a
> >>>>> single
> >>>>>> or
> >>>>>>>>> group
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> decision
> >>>>>>>>>>>>>>>> makers? *Please let me know how to come to a
> >>>>> conclusion
> >>>>>>>> here;
> >>>>>>>>>>> what
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> next steps?*
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For reference, here are the three alternatives
> >> I
> >>>>>> pointed
> >>>>>>>> out.
> >>>>>>>>>> You
> >>>>>>>>>>>> can
> >>>>>>>>>>>>>>> find
> >>>>>>>>>>>>>>>> detailed description of their PROs and CONs in
> >> my
> >>>>>>> comment:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. My initial proposal, i.e., encoding only-NaN
> >>>> pages
> >>>>>> by
> >>>>>>>>>>>> min=max=NaN.
> >>>>>>>>>>>>>>>> 2. Adding `num_values` to the ColumnIndex, to
> >> make
> >>>> it
> >>>>>>>>> symmetric
> >>>>>>>>>>>> with
> >>>>>>>>>>>>>>>> Statistics in pages & `ColumnMetaData` and to
> >>>> enable
> >>>>>> the
> >>>>>>>>>>>> computation
> >>>>>>>>>>>>>>>> `num_values - null_count - nan_count == 0`
> >>>>>>>>>>>>>>>> 3. Adding a `nan_pages` bool list to the column
> >>>>> index,
> >>>>>>>> which
> >>>>>>>>>>>>> indicates
> >>>>>>>>>>>>>>>> whether a page contains only NaNs
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers
> >>>>>>>>>>>>>>>> Jan Finis
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
>
>
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to