Hello Jan and others, First, let me preface by saying I am quite new here. So I apologize if there is some other better way to bring up these concerns. I understand it is very annoying to come in at the 11th hour and start bringing up a bunch of concerns, but I would also like this to be done right. A colleague of mine brought up some concerns and alternative approaches in the GitHub thread; I will file some of the concerns here as a response.
> Treating NaNs so specially is giving them attention they don't deserve. Most data sets do not contain NaNs. If a use case really requires them and needs filtering to ignore them, they can store NULL instead, or encode them differently. I would prefer the average case over the special case here. NaNs are less common in the SQL world than in the DataFrame world where NaNs were used for a long time to represent missing values. They still exist with different canonical representations and different sign bits. I agree it might not be correct semantically, but sadly that is the world we deal with. NumPy and Numba do not have missing data functionality, people use NaNs there, and people definitely use that in their analytical dataflows. Another point that was brought up in the GH discussion was "what about infinity? You could argue that having infinity in statistics is similarly unuseful as it's too wide of a bound". I would argue that infinity is very different as there is no discussion on what the ordering or pattern of infinity is. Everyone agrees that `min(1.0, inf, -inf) == -inf` and each infinity only has a single bit pattern. > It gives a defined order to every bit pattern and thus yields a total order, mathematically speaking, which has value by itself. With NaN counts, it was still undefined how different bit patterns of NaNs were supposed to be ordered, whether NaN was allowed to have a sign bit, etc., risking that different engines could come to different results while filtering or sorting values within a file. Since the proposal phrases it as a goal to work "regardless of how they order NaN w.r.t. other values" this statement feels out-of-place to me. Most hardware and most people don't care about total ordering and needing to take it into account while filtering using statistics seems like preferring the special case instead of the common case. Almost noone filters for specific NaN value bit-patterns. SQL engines that don't have IEEE total ordering as their default ordering for floats will also need to do more special handling for this. I also agree with my colleague that doing an approach that is 50% of the way there will make the barrier to improving it to what it actually should be later on much higher. As for ways forward, I propose merging the `nan_count` and `sort ordering` proposals into one to make one proposal, as they are linked together, and moving forward with one without knowing what will happen to the other seems unwise. From a Polars perspective, having a `nan_count` and defining what happens to the `min` and `max` statistics when a page contains only NaNs is enough to allow for all predicate filtering. I think, but correct me if I am wrong, this is also enough for all SQL engines that don't use total ordering. But if you want to be impartial to the engine's floating-point ordering and allow engines with total ordering to do inequality filters when `nan_count > 0` you would need a `positive_nan_count` and a `negative_nan_count`. I understand the downside with Thrift complexity, but introducing another sort order is also adding complexity just in a different place. I would really like to see this move forward, so I hope these concerns help move it forward towards a solution that works for everyone. Kind regards, Gijs On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb <andrewlam...@gmail.com> wrote: > I would also be in favor of starting a vote > > On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <jpfi...@gmail.com> wrote: > > > As the author of both the IEEE754 total order > > <https://github.com/apache/parquet-format/pull/221> PR and the earlier > PR > > that basically proposed `nan_count` > > <https://github.com/apache/parquet-format/pull/196>, my current vote > would > > be for IEEE754 total order. > > Consequently, I would like to request a formal vote for the PR > introducing > > IEEE754 total order (https://github.com/apache/parquet-format/pull/221), > > if > > that is possible. > > > > My Rationales: > > > > - It's conceptually simpler. It's easier to explain. It's based on an > > IEEE-standardized order predicate. > > - There are already multiple implementations showing feasibility. This > > will likely make the adoption quicker. > > - It gives a defined order to every bit pattern and thus yields a > total > > order, mathematically speaking, which has value by itself. With NaN > > counts, > > it was still undefined how different bit patterns of NaNs were > supposed > > to > > be ordered, whether NaN was allowed to have a sign bit, etc., risking > > that > > different engines could come to different results while filtering or > > sorting values within a file. > > - It also solves sort order completely. With nan_counts only, it is > > still undefined whether nans should be sorted before or after all > values > > (or both, depending on sign bit), so any file including NaNs could not > > really leverage sort order without being ambiguous. > > - It's less complex in thrift. Having fields that only apply to a > > handful of data types is somehow weird. If every type did this, we > would > > have a plethora of non-generic fields in thrift. > > - Treating NaNs so specially is giving them attention they don't > > deserve. Most data sets do not contain NaNs. If a use case really > > requires > > them and needs filtering to ignore them, they can store NULL instead, > > or encode them differently. I would prefer the average case over the > > special case here. > > - The majority of the people discussing this so far seem to favor > total > > order. > > > > Cheers, > > Jan > > > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <ust...@gmail.com>: > > > > > Hi all, > > > > > > As this discussion has been open for more than two years, I’d like to > > bump > > > up > > > this thread again to update the progress and collect feedback. > > > > > > *Background* > > > • Today Parquet’s min/max stats and page index omit NaNs entirely. > > > • Engines can’t safely prune floating values because they know nothing > on > > > NaNs. > > > • Column index is disabled if any page contains only NaNs. > > > > > > There are two active proposals as below: > > > > > > *Proposal A - IEEE754TotalOrder* (from the PR [1]) > > > • Define a new ColumnOrder to include +0, –0 and all NaN bit‐patterns. > > > • Stats and column index store NaNs if they appear. > > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and parquet-java > > [4]. > > > • For more context of this approach, please refer to discussion in [5]. > > > > > > *Proposal B - add nan_count* (from a comment [6] to [1]) > > > • Add `nan_count` to stats and a `nan_counts` list to column index. > > > • For all‐NaNs cases, write NaN to min/max and use nan_count to > > > distinguish. > > > > > > Both solutions have pros and cons but are way better than the status > quo > > > today. > > > Please share your thoughts on the two proposals above, or maybe come up > > > with > > > better alternatives. We need consensus on one proposal and move > forward. > > > > > > [1] https://github.com/apache/parquet-format/pull/221 > > > [2] https://github.com/apache/arrow-rs/pull/7408 > > > [3] > > > > > > https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder > > > [4] https://github.com/apache/parquet-java/pull/3191 > > > [5] https://github.com/apache/parquet-format/pull/196 > > > [6] > > > > > > https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077 > > > > > > Best, > > > Gang > > > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <jpfi...@gmail.com> wrote: > > > > > > > Dear contributors, > > > > > > > > My PR has now gathered comments for a week and the gist of all open > > > issues > > > > is the question of how to encode pages/column chunks that contain > only > > > > NaNs. There are different suggestions and I don't see one common > > favorite > > > > yet. > > > > > > > > I have outlined three alternatives of how we can handle these and I > > want > > > us > > > > to reach a conclusion here, so I can update my PR accordingly and > move > > on > > > > with it. As this is my first contribution to parquet, I don't know > the > > > > decision processes here. Do we vote? Is there a single or group of > > > decision > > > > makers? *Please let me know how to come to a conclusion here; what > are > > > the > > > > next steps?* > > > > > > > > For reference, here are the three alternatives I pointed out. You can > > > find > > > > detailed description of their PROs and CONs in my comment: > > > > > > > > > > https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 > > > > > > > > 1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN. > > > > 2. Adding `num_values` to the ColumnIndex, to make it symmetric with > > > > Statistics in pages & `ColumnMetaData` and to enable the computation > > > > `num_values - null_count - nan_count == 0` > > > > 3. Adding a `nan_pages` bool list to the column index, which > indicates > > > > whether a page contains only NaNs > > > > > > > > > > > > Cheers > > > > Jan Finis > > > > > > > > > >