Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Jan Finis Wed, 13 Aug 2025 02:40:00 -0700

I have now tagged
<https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173>
the people that argued for total order in the initial PR. Let's see their
response.


If I understand the adoption process correctly, the next hurdle to getting
this adopted is two open
source (!) implementations proving its feasibility. We already had that for
IEEE total order. If we
prefer the solution with nan counts, we'll need it there as well. I myself
work on a proprietary
implementation, so I'm counting on others here :). Be prepared though, this
will likely take months
unless the interest in this topic has risen to a point where people are
eager to jump on the implementation
right away.

So, I guess it will take some months of soaking time before any formal vote
can be done
(given that we reach consensus that this is what we want and we find people
for the implementations).

Cheers,
Jan

Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue <rdb...@gmail.com>:

> Thanks, Jan. I also went through the combined proposal and it looks mostly
> good to me.
>
> > First of all, to make it quick: Yes, the solution of having nan_counts
> *and* total order, which was brought up multiple times, does work and
> solves more cases than just either of both.
>
> Great, then we have a solution for both filtering use cases and for moving
> ahead with total order. And thanks to Andrew for suggesting this as well on
> the second PR. I think this also looks like this is something that Orson is
> okay with given his comments on the latest PR.
>
> Is there anyone against the combined approach? I don't see a big downside
> for anyone. It is compatible with previous stats rules, has a NaN count,
> and allows using either type-specific order or total order.
>
> Assuming that this satisfies the big objections, I think we should wait for
> a few days to make sure everyone has time to check out the new PR and then
> vote to adopt it.
>
> Ryan
>
> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
>
> > Thank you Jan -- I read through the new combined proposal, and I thought
> it
> > looks good and addresses the feedback so far. I left some small style
> > suggestions, but nothing that is required from my perspective
> >
> >
> >
> > On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <jpfi...@gmail.com> wrote:
> >
> > > Hey Ryan,
> > >
> > > Thanks for chiming in. First of all, to make it quick: Yes, the
> solution
> > of
> > > having nan_counts *and* total order, which was brought up multiple
> times,
> > > does work and solves more cases than just either of both.
> > >
> > > I strongly prefer continuing to discuss the merits of these approaches
> > > > rather than trying to decide with a vote.
> > >
> > >
> > > In theory, I agree that it isn't good to silence a discussion by just
> > > voting for one possible solution and technical issues should be
> > discussed.
> > > However, please note that we have been circling on this for over two
> > years
> > > now, including an extended discussion that brought up all arguments
> > > multiple times. This is in stark contrast to the
> > > speed with which you guys work on the Iceberg spec, for example. There,
> > you
> > > also do not discuss the merits of various solutions for multiple years.
> > You
> > > just pick one and merge it after a *reasonable* time of discussion.
> > > If you had the speed we currently have here, nothing would get done.
> > Thus,
> > > I see this as a clear case of *"the perfect is the enemy of the good"*.
> > > Yes, we can continue looking for the perfect solution,
> > > but that will likely lead to keeping us at the status quo, which is the
> > > worst of them all.
> > >
> > > That being said, I'm also happy to create a PR which does both total
> > order
> > > and NaN counts; after all, I just want the issue solved and all these
> > > solutions are better than the status quo.
> > >
> > > *As this was now suggest by at least three people, I guess it's worth
> > > doing, so here you go:
> https://github.com/apache/parquet-format/pull/514
> > > <https://github.com/apache/parquet-format/pull/514>*
> > >
> > > With this, we should have PRs covering most of the solution space.
> > > (I'm refusing to create a PR with negative and positive nan_counts;
> > > nan_counts + total order has to suffice; the complexity madness has to
> > stop
> > > somewhere)
> > > I still believe that there was an amount of people who already found
> > > nan_counts too complex and therefore wanted IEEE total order, and these
> > > people may not like putting on extra complexity,
> > > but let's see, maybe some have also changed their opinion in the
> > meantime.
> > >
> > >
> > > *Given all this, we can also first do an informal vote where everyone
> can
> > > vote for which of the three their favorite would be.Maybe a clear
> > favorite
> > > will emerge and then we can vote on this one.*
> > >
> > > But of course, we can also take some weeks to discuss the three
> > solutions,
> > > now that we have PRs for all of them. I just hope this won't make us
> > > continue for another 2 years, or an
> > > infinite stalemate where each solution is vetoed by a PMC member.
> > > (Sorry for becoming a bit cynical here; I have just spent way too much
> > time
> > > of my life with double statistics at this point ;) ...)
> > >
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue <rdb...@gmail.com
> >:
> > >
> > > > Regarding the process for this, I strongly prefer continuing to
> discuss
> > > the
> > > > merits of these approaches rather than trying to decide with a vote.
> I
> > > > don't think it is a good practice to use a vote to decide on a
> > technical
> > > > direction. There are very few situations that warrant it and I don't
> > > think
> > > > that this is one of them. While this issue has been open for a long
> > time,
> > > > that appears to be the result of it not being anyone's top priority
> > > rather
> > > > than indecision.
> > > >
> > > > For the technical merits of these approaches, I think that we can
> find
> > a
> > > > middle ground. I agree with Jan that when working with sorted values,
> > we
> > > > need to know how NaN values were handled and that requires using a
> > > > well-defined order that includes NaN and its variations (because we
> > > should
> > > > not normalize). Using NaN count is not sufficient for ordering rows.
> > > >
> > > > Gijs also brings up good points about how NaN values show up in
> actual
> > > > datasets: not just when used in place of null, but also as the result
> > of
> > > > normal calculations on abnormal data, like `sqrt(-4.0)` or
> `log(-1.0)`.
> > > > Both of those present problems when mixed with valid data because of
> > the
> > > > stats "poisoning" problem, where the range of valid data is usable
> > until
> > > a
> > > > single NaN is mixed in.
> > > >
> > > > Another issue is that NaN is error-prone because "regular" comparison
> > is
> > > > always false:
> > > > ```
> > > > Math.log(-1.0) >= 2 => FALSE
> > > > Math.log(-1.0) < 2 => FALSE
> > > > 2 > Math.log(-1.0) => FALSE
> > > > ```
> > > >
> > > > As a result, Iceberg doesn't trust NaN values as either lower or
> upper
> > > > bounds because we don't want to go back to the code that produced the
> > > value
> > > > to see what the comparison order was to determine whether NaN values
> go
> > > > before or after others.
> > > >
> > > > Total order solves the second issue in theory, but regular comparison
> > is
> > > > prevalent and not obvious to developers. And it also doesn't help
> when
> > > NaN
> > > > is used instead of null. So using total order is not sufficient for
> > data
> > > > skipping.
> > > >
> > > > I think the right compromise is to use `min`, `max`, and `nan_count`
> > for
> > > > data skipping stats (where min and max cannot be NaN) and total
> > ordering
> > > > for sorting values. That satisfies the data skipping use cases and
> also
> > > > gives us an ordering of unaltered values that we can reason about.
> > > >
> > > > Does anyone think that doesn't work?
> > > >
> > > > Ryan
> > > >
> > > > On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <ust...@gmail.com> wrote:
> > > >
> > > > > Thanks Jan for your endless effort on this!
> > > > >
> > > > > I'm in favor of simplicity and generalism. I think we have already
> > > > debated
> > > > > a lot
> > > > > for `nan_count` in [1] and [2] is the reflection of those
> > discussions.
> > > > > Therefore
> > > > > I am inclined to start a vote for [2] unless there is a
> significantly
> > > > > better
> > > > > proposal.
> > > > >
> > > > > I would suggest everyone interested in this discussion to attend
> the
> > > > > scheduled
> > > > > sync on Aug 6th (detailed below) to spread the word to the broader
> > > > > community.
> > > > > If we can get a consensus on [2], I can help start the vote and
> move
> > > > > forward.
> > > > >
> > > > > *Apache Parquet Community Sync Wednesday, August 6 · 10:00 –
> 11:00am
> > *
> > > > > *Time zone: America/Los_Angeles*
> > > > > *Google Meet joining info Video call link:
> > > > > https://meet.google.com/bhe-rvan-qjk
> > > > > <https://meet.google.com/bhe-rvan-qjk> *
> > > > >
> > > > > [1] https://github.com/apache/parquet-format/pull/196
> > > > > [2] https://github.com/apache/parquet-format/pull/221
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > >
> > > > > On Fri, Aug 1, 2025 at 6:16 PM Jan Finis <jpfi...@gmail.com>
> wrote:
> > > > >
> > > > > > Hi Gijs,
> > > > > >
> > > > > > Thank you for bringing up concrete points, I'm happy to discuss
> > them
> > > in
> > > > > > detail.
> > > > > >
> > > > > > NaNs are less common in the SQL world than in the DataFrame world
> > > where
> > > > > > > NaNs were used for a long time to represent missing values.
> > > > > >
> > > > > >
> > > > > > You could transcode between NULL to NaN before reading and
> writing
> > to
> > > > > > Parquet. You basically mention yourself that NaNs were used for
> > > missing
> > > > > > values, i.e., what is commonly a NULL, which wasn't available.
> So,
> > > > > > semantically, transcoding to NULL would even be the sane thing to
> > do.
> > > > > Yes,
> > > > > > that will cost you some cycles, but should be a rather
> lightweight
> > > > > > operation in comparison to most other operations, so I would
> argue
> > > that
> > > > > it
> > > > > > won't totally ruin your performance. Similarly, why should
> Parquet
> > > play
> > > > > > along with a "hack" that was done in other frameworks due to
> > > > shortcomings
> > > > > > of those frameworks? So from a philosophical point of view, I
> think
> > > > > > supporting NaNs better is the wrong thing to do. Rather, we
> should
> > > be a
> > > > > > forcing function to align others to better behavior, so appling a
> > bit
> > > > of
> > > > > > force might in the long run make people use NULLs also in
> > DataFrames.
> > > > > >
> > > > > > Of course, your argument also goes into the direction of
> > pragmatism:
> > > > If a
> > > > > > large part of the data science world uses NaNs to encode missing
> > > > values,
> > > > > > then maybe Parquet should accept this de-facto standard rather
> than
> > > > > > fighting it. That is indeed a valid point. The weight of it is
> > > > debatable
> > > > > > and my personal conclusion is that it's still not worth it, as
> you
> > > can
> > > > > > transcode between NULLs and NaNs, but I do agree with its
> validity.
> > > > > >
> > > > > >
> > > > > > Since the proposal phrases it as a goal to work "regardless of
> how
> > > they
> > > > > > > order NaN w.r.t. other values" this statement feels
> out-of-place
> > to
> > > > me.
> > > > > > > Most hardware and most people don't care about total ordering
> and
> > > > > needing
> > > > > > > to take it into account while filtering using statistics seems
> > like
> > > > > > > preferring the special case instead of the common case. Almost
> > > noone
> > > > > > > filters for specific NaN value bit-patterns. SQL engines that
> > don't
> > > > > have
> > > > > > > IEEE total ordering as their default ordering for floats will
> > also
> > > > need
> > > > > > to
> > > > > > > do more special handling for this.
> > > > > >
> > > > > >
> > > > > > I disagree with the conclusion this statement draws. The current
> > > > > behavior,
> > > > > > and nan_counts without total ordering, pose a real problem here,
> > even
> > > > for
> > > > > > engines that don't care about bit patterns. I do agree that most
> > > > database
> > > > > > engines, including the one I'm working on, do not care about bit
> > > > patterns
> > > > > > and/or sign bits. However, how can our database engine know
> whether
> > > the
> > > > > > writer of a Parquet file saw it the same way? It can't.
> Therefore,
> > it
> > > > > > cannot know whether a writer, for example, ordered NaNs before or
> > > after
> > > > > all
> > > > > > other numbers, or maybe ordered them by sign bit. So, if our
> > database
> > > > > > engine now sees a float column in sorting columns, it cannot
> apply
> > > any
> > > > > > optimization without a lot of special casing, as it doesn't know
> > > > whether
> > > > > > NaNs will be before all other values, after all other values, or
> > > maybe
> > > > > > both, depending on sign bit. It could apply contrived logic that
> > > tries
> > > > to
> > > > > > infer where NaNs were placed from the NaN counts of the first and
> > > last
> > > > > > page, but doing so will be a lot of ugly code that also feels to
> be
> > > in
> > > > > the
> > > > > > wrong place. I.e., I don't want to need to load pages or the page
> > > > index,
> > > > > > just to reason about a sort order.
> > > > > >
> > > > > > SQL engines that don't have
> > > > > > > IEEE total ordering as their default ordering for floats will
> > also
> > > > need
> > > > > > to
> > > > > > > do more special handling for this.
> > > > > >
> > > > > >
> > > > > > This code, which I would indeed need to write for our engine, is
> > > > > comparably
> > > > > > trivial. Simply choose the largest possible bit pattern as
> > comparison
> > > > for
> > > > > > upper bounds filtering for NaN, and the smallest possible bit
> > pattern
> > > > for
> > > > > > lower bounds. It's not more than a few lines of code that check
> > > > whether a
> > > > > > filter is NaN and then replace its value with the highest/lowest
> > NaN
> > > > bit
> > > > > > pattern. It is similarly trivial to the special casing I need to
> do
> > > > with
> > > > > > nan_counts, and it is way more trivial than the extra code I
> would
> > > need
> > > > > to
> > > > > > write for sorting columns, as depicted above.
> > > > > >
> > > > > > From a Polars perspective, having a `nan_count` and defining what
> > > > > > > happens to the `min` and `max` statistics when a page contains
> > only
> > > > > NaNs
> > > > > > is
> > > > > > > enough to allow for all predicate filtering. I think, but
> correct
> > > me
> > > > > if I
> > > > > > > am wrong, this is also enough for all SQL engines that don't
> use
> > > > total
> > > > > > > ordering.
> > > > > >
> > > > > >
> > > > > > It's not fully enough, as depicted above. Sorting columns would
> > still
> > > > not
> > > > > > work properly.
> > > > > >
> > > > > > As for ways forward, I propose merging the `nan_count` and `sort
> > > > > ordering`
> > > > > > > proposals into one to make one proposal
> > > > > >
> > > > > >
> > > > > > Note that the initial reason for proposing IEEE total order was
> > that
> > > > > people
> > > > > > in the discussion threads found nan_counts to be too complex and
> > too
> > > > much
> > > > > > of an undeserving special case (re-read the discussion in the
> > initial
> > > > PR
> > > > > > <https://github.com/apache/parquet-format/pull/196> to see the
> > > > > > rationales).
> > > > > > So merging both together would go totally against the spirit of
> why
> > > > IEEE
> > > > > > total order was proposed. While it has further upsides, the main
> > > reason
> > > > > was
> > > > > > indeed to *not have* nan_counts. If now the proposal would even
> go
> > to
> > > > > > positive and negative nan counts (i.e., even more complexity),
> this
> > > > would
> > > > > > go 180 degrees into the opposite direction of why people wanted
> > total
> > > > > order
> > > > > > in the first place.
> > > > > >
> > > > > > Cheers,
> > > > > > Jan
> > > > > >
> > > > > > Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn
> > > > > > <g...@polars.tech.invalid>:
> > > > > >
> > > > > > > Hello Jan and others,
> > > > > > >
> > > > > > > First, let me preface by saying I am quite new here. So I
> > apologize
> > > > if
> > > > > > > there is some other better way to bring up these concerns. I
> > > > understand
> > > > > > it
> > > > > > > is very annoying to come in at the 11th hour and start bringing
> > up
> > > a
> > > > > > bunch
> > > > > > > of concerns, but I would also like this to be done right. A
> > > colleague
> > > > > of
> > > > > > > mine brought up some concerns and alternative approaches in the
> > > > GitHub
> > > > > > > thread; I will file some of the concerns here as a response.
> > > > > > >
> > > > > > > > Treating NaNs so specially is giving them attention they
> don't
> > > > > deserve.
> > > > > > > Most data sets do not contain NaNs. If a use case really
> requires
> > > > them
> > > > > > and
> > > > > > > needs filtering to ignore them, they can store NULL instead, or
> > > > encode
> > > > > > them
> > > > > > > differently. I would prefer the average case over the special
> > case
> > > > > here.
> > > > > > >
> > > > > > > NaNs are less common in the SQL world than in the DataFrame
> world
> > > > where
> > > > > > > NaNs were used for a long time to represent missing values.
> They
> > > > still
> > > > > > > exist with different canonical representations and different
> sign
> > > > > bits. I
> > > > > > > agree it might not be correct semantically, but sadly that is
> the
> > > > world
> > > > > > we
> > > > > > > deal with. NumPy and Numba do not have missing data
> > functionality,
> > > > > people
> > > > > > > use NaNs there, and people definitely use that in their
> > analytical
> > > > > > > dataflows. Another point that was brought up in the GH
> discussion
> > > was
> > > > > > "what
> > > > > > > about infinity? You could argue that having infinity in
> > statistics
> > > is
> > > > > > > similarly unuseful as it's too wide of a bound". I would argue
> > that
> > > > > > > infinity is very different as there is no discussion on what
> the
> > > > > ordering
> > > > > > > or pattern of infinity is. Everyone agrees that `min(1.0, inf,
> > > -inf)
> > > > ==
> > > > > > > -inf` and each infinity only has a single bit pattern.
> > > > > > >
> > > > > > > > It gives a defined order to every bit pattern and thus
> yields a
> > > > total
> > > > > > > order, mathematically speaking, which has value by itself. With
> > NaN
> > > > > > counts,
> > > > > > > it was still undefined how different bit patterns of NaNs were
> > > > supposed
> > > > > > to
> > > > > > > be ordered, whether NaN was allowed to have a sign bit, etc.,
> > > risking
> > > > > > that
> > > > > > > different engines could come to different results while
> filtering
> > > or
> > > > > > > sorting values within a file.
> > > > > > >
> > > > > > > Since the proposal phrases it as a goal to work "regardless of
> > how
> > > > they
> > > > > > > order NaN w.r.t. other values" this statement feels
> out-of-place
> > to
> > > > me.
> > > > > > > Most hardware and most people don't care about total ordering
> and
> > > > > needing
> > > > > > > to take it into account while filtering using statistics seems
> > like
> > > > > > > preferring the special case instead of the common case. Almost
> > > noone
> > > > > > > filters for specific NaN value bit-patterns. SQL engines that
> > don't
> > > > > have
> > > > > > > IEEE total ordering as their default ordering for floats will
> > also
> > > > need
> > > > > > to
> > > > > > > do more special handling for this.
> > > > > > >
> > > > > > > I also agree with my colleague that doing an approach that is
> 50%
> > > of
> > > > > the
> > > > > > > way there will make the barrier to improving it to what it
> > actually
> > > > > > should
> > > > > > > be later on much higher.
> > > > > > >
> > > > > > > As for ways forward, I propose merging the `nan_count` and
> `sort
> > > > > > ordering`
> > > > > > > proposals into one to make one proposal, as they are linked
> > > together,
> > > > > and
> > > > > > > moving forward with one without knowing what will happen to the
> > > other
> > > > > > seems
> > > > > > > unwise. From a Polars perspective, having a `nan_count` and
> > > defining
> > > > > what
> > > > > > > happens to the `min` and `max` statistics when a page contains
> > only
> > > > > NaNs
> > > > > > is
> > > > > > > enough to allow for all predicate filtering. I think, but
> correct
> > > me
> > > > > if I
> > > > > > > am wrong, this is also enough for all SQL engines that don't
> use
> > > > total
> > > > > > > ordering. But if you want to be impartial to the engine's
> > > > > floating-point
> > > > > > > ordering and allow engines with total ordering to do inequality
> > > > filters
> > > > > > > when `nan_count > 0` you would need a `positive_nan_count` and
> a
> > > > > > > `negative_nan_count`. I understand the downside with Thrift
> > > > complexity,
> > > > > > but
> > > > > > > introducing another sort order is also adding complexity just
> in
> > a
> > > > > > > different place.
> > > > > > >
> > > > > > > I would really like to see this move forward, so I hope these
> > > > concerns
> > > > > > help
> > > > > > > move it forward towards a solution that works for everyone.
> > > > > > >
> > > > > > > Kind regards,
> > > > > > > Gijs
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <
> > > andrewlam...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I would also be in favor of starting a vote
> > > > > > > >
> > > > > > > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <
> jpfi...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > As the author of both the IEEE754 total order
> > > > > > > > > <https://github.com/apache/parquet-format/pull/221> PR and
> > the
> > > > > > earlier
> > > > > > > > PR
> > > > > > > > > that basically proposed `nan_count`
> > > > > > > > > <https://github.com/apache/parquet-format/pull/196>, my
> > > current
> > > > > vote
> > > > > > > > would
> > > > > > > > > be for IEEE754 total order.
> > > > > > > > > Consequently, I would like to request a formal vote for the
> > PR
> > > > > > > > introducing
> > > > > > > > > IEEE754 total order (
> > > > > > https://github.com/apache/parquet-format/pull/221
> > > > > > > ),
> > > > > > > > > if
> > > > > > > > > that is possible.
> > > > > > > > >
> > > > > > > > > My Rationales:
> > > > > > > > >
> > > > > > > > >    - It's conceptually simpler. It's easier to explain.
> It's
> > > > based
> > > > > on
> > > > > > > an
> > > > > > > > >    IEEE-standardized order predicate.
> > > > > > > > >    - There are already multiple implementations showing
> > > > > feasibility.
> > > > > > > This
> > > > > > > > >    will likely make the adoption quicker.
> > > > > > > > >    - It gives a defined order to every bit pattern and thus
> > > > yields
> > > > > a
> > > > > > > > total
> > > > > > > > >    order, mathematically speaking, which has value by
> itself.
> > > > With
> > > > > > NaN
> > > > > > > > > counts,
> > > > > > > > >    it was still undefined how different bit patterns of
> NaNs
> > > were
> > > > > > > > supposed
> > > > > > > > > to
> > > > > > > > >    be ordered, whether NaN was allowed to have a sign bit,
> > > etc.,
> > > > > > > risking
> > > > > > > > > that
> > > > > > > > >    different engines could come to different results while
> > > > > filtering
> > > > > > or
> > > > > > > > >    sorting values within a file.
> > > > > > > > >    - It also solves sort order completely. With nan_counts
> > > only,
> > > > it
> > > > > > is
> > > > > > > > >    still undefined whether nans should be sorted before or
> > > after
> > > > > all
> > > > > > > > values
> > > > > > > > >    (or both, depending on sign bit), so any file including
> > NaNs
> > > > > could
> > > > > > > not
> > > > > > > > >    really leverage sort order without being ambiguous.
> > > > > > > > >    - It's less complex in thrift. Having fields that only
> > apply
> > > > to
> > > > > a
> > > > > > > > >    handful of data types is somehow weird. If every type
> did
> > > > this,
> > > > > we
> > > > > > > > would
> > > > > > > > >    have a plethora of non-generic fields in thrift.
> > > > > > > > >    - Treating NaNs so specially is giving them attention
> they
> > > > don't
> > > > > > > > >    deserve. Most data sets do not contain NaNs. If a use
> case
> > > > > really
> > > > > > > > > requires
> > > > > > > > >    them and needs filtering to ignore them, they can store
> > NULL
> > > > > > > instead,
> > > > > > > > >    or encode them differently. I would prefer the average
> > case
> > > > over
> > > > > > the
> > > > > > > > >    special case here.
> > > > > > > > >    - The majority of the people discussing this so far seem
> > to
> > > > > favor
> > > > > > > > total
> > > > > > > > >    order.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Jan
> > > > > > > > >
> > > > > > > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <
> > > > > ust...@gmail.com
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > As this discussion has been open for more than two years,
> > I’d
> > > > > like
> > > > > > to
> > > > > > > > > bump
> > > > > > > > > > up
> > > > > > > > > > this thread again to update the progress and collect
> > > feedback.
> > > > > > > > > >
> > > > > > > > > > *Background*
> > > > > > > > > > • Today Parquet’s min/max stats and page index omit NaNs
> > > > > entirely.
> > > > > > > > > > • Engines can’t safely prune floating values because they
> > > know
> > > > > > > nothing
> > > > > > > > on
> > > > > > > > > > NaNs.
> > > > > > > > > > • Column index is disabled if any page contains only
> NaNs.
> > > > > > > > > >
> > > > > > > > > > There are two active proposals as below:
> > > > > > > > > >
> > > > > > > > > > *Proposal A - IEEE754TotalOrder* (from the PR [1])
> > > > > > > > > > • Define a new ColumnOrder to include +0, –0 and all NaN
> > > > > > > bit‐patterns.
> > > > > > > > > > • Stats and column index store NaNs if they appear.
> > > > > > > > > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and
> > > > > > > parquet-java
> > > > > > > > > [4].
> > > > > > > > > > • For more context of this approach, please refer to
> > > discussion
> > > > > in
> > > > > > > [5].
> > > > > > > > > >
> > > > > > > > > > *Proposal B - add nan_count* (from a comment [6] to [1])
> > > > > > > > > > • Add `nan_count` to stats and a `nan_counts` list to
> > column
> > > > > index.
> > > > > > > > > > • For all‐NaNs cases, write NaN to min/max and use
> > nan_count
> > > to
> > > > > > > > > > distinguish.
> > > > > > > > > >
> > > > > > > > > > Both solutions have pros and cons but are way better than
> > the
> > > > > > status
> > > > > > > > quo
> > > > > > > > > > today.
> > > > > > > > > > Please share your thoughts on the two proposals above, or
> > > maybe
> > > > > > come
> > > > > > > up
> > > > > > > > > > with
> > > > > > > > > > better alternatives. We need consensus on one proposal
> and
> > > move
> > > > > > > > forward.
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/221
> > > > > > > > > > [2] https://github.com/apache/arrow-rs/pull/7408
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
> > > > > > > > > > [4] https://github.com/apache/parquet-java/pull/3191
> > > > > > > > > > [5] https://github.com/apache/parquet-format/pull/196
> > > > > > > > > > [6]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Gang
> > > > > > > > > >
> > > > > > > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <
> > jpfi...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Dear contributors,
> > > > > > > > > > >
> > > > > > > > > > > My PR has now gathered comments for a week and the gist
> > of
> > > > all
> > > > > > open
> > > > > > > > > > issues
> > > > > > > > > > > is the question of how to encode pages/column chunks
> that
> > > > > contain
> > > > > > > > only
> > > > > > > > > > > NaNs. There are different suggestions and I don't see
> one
> > > > > common
> > > > > > > > > favorite
> > > > > > > > > > > yet.
> > > > > > > > > > >
> > > > > > > > > > > I have outlined three alternatives of how we can handle
> > > these
> > > > > > and I
> > > > > > > > > want
> > > > > > > > > > us
> > > > > > > > > > > to reach a conclusion here, so I can update my PR
> > > accordingly
> > > > > and
> > > > > > > > move
> > > > > > > > > on
> > > > > > > > > > > with it. As this is my first contribution to parquet, I
> > > don't
> > > > > > know
> > > > > > > > the
> > > > > > > > > > > decision processes here. Do we vote? Is there a single
> or
> > > > group
> > > > > > of
> > > > > > > > > > decision
> > > > > > > > > > > makers? *Please let me know how to come to a conclusion
> > > here;
> > > > > > what
> > > > > > > > are
> > > > > > > > > > the
> > > > > > > > > > > next steps?*
> > > > > > > > > > >
> > > > > > > > > > > For reference, here are the three alternatives I
> pointed
> > > out.
> > > > > You
> > > > > > > can
> > > > > > > > > > find
> > > > > > > > > > > detailed description of their PROs and CONs in my
> > comment:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
> > > > > > > > > > >
> > > > > > > > > > > 1. My initial proposal, i.e., encoding only-NaN pages
> by
> > > > > > > min=max=NaN.
> > > > > > > > > > > 2. Adding `num_values` to the ColumnIndex, to make it
> > > > symmetric
> > > > > > > with
> > > > > > > > > > > Statistics in pages & `ColumnMetaData` and to enable
> the
> > > > > > > computation
> > > > > > > > > > > `num_values - null_count - nan_count == 0`
> > > > > > > > > > > 3. Adding a `nan_pages` bool list to the column index,
> > > which
> > > > > > > > indicates
> > > > > > > > > > > whether a page contains only NaNs
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Cheers
> > > > > > > > > > > Jan Finis
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

Reply via email to