Thanks Zehua! Really appreciate it! On Mon, Mar 16, 2026 at 10:40 AM Zehua Zou <[email protected]> wrote:
> Hello Gang and others, > > I am willing to implement the C++ POC. > > > > > 2026年3月14日 23:56,Gang Wu <[email protected]> 写道: > > > > Update: > > > > Java POC is ready for IEEE 754 column order combined with nan_count: > > https://github.com/apache/parquet-java/pull/3393 > > > > The spec PR has been updated earlier to address all comments: > > https://github.com/apache/parquet-format/pull/514 > > > > Really appreciate any review and feedback! > > > > Best, > > Gang > > > > > > > > > > On Wed, Feb 11, 2026 at 4:24 PM Gang Wu <[email protected]> wrote: > > > >> Hello all, > >> > >> I'm reaching out to help drive this long-running discussion—nearly > >> three years now—towards a final resolution. With Jan's authorization, > >> and my sincere thanks for his sustained effort, I want to help push > >> this issue to the finish line. > >> > >> To recap, we have two primary proposals on how to handle NaNs in > >> statistics and column indexes: > >> > >> * IEEE 754 Total Order [1]: Proposes adding a new column order > >> IEEE754TotalOrder for FLOAT/DOUBLE/FLOAT16. This provides a defined > >> ordering for every float bit pattern, including NaNs and -0/+0, > >> allowing writers to include NaNs in min/max and removing ambiguity for > >> only-NaN pages. > >> * Combined Approach [2]: Proposes adopting the IEEE 754 total order > >> alongside explicit nan_count(s) fields. This approach mandates the > >> nan_count(s) when the new order is used and clarifies how to handle > >> edge cases from legacy writers. > >> > >> Based on the recent comments, it appears the combined approach [2] is > >> gaining consensus, although the IEEE 754 total order [1] still has > >> strong advocates. > >> > >> I agree with the sentiment that technical direction should be made by > >> consensus, not a vote. To that end, I'd like to solicit further > >> feedback specifically on the combined approach [2] to see if we can > >> achieve the necessary consensus to move forward now. > >> > >> I recall that the total order proposal [1] already has three PoC > >> implementations. For the combined approach [2], I can draft a PoC in > >> parquet-java, but to meet the two-implementation requirement, we would > >> need one more contributor to step up. > >> > >> [1] https://github.com/apache/parquet-format/pull/221 > >> [2] https://github.com/apache/parquet-format/pull/514 > >> > >> Best, > >> Gang > >> > >> > >> On Sat, Aug 16, 2025 at 1:59 AM Gijs Burghoorn <[email protected] > > > >> wrote: > >>> > >>> Hello Jan, > >>> > >>> Thank you for pushing this through. Apart from some smaller nits, we > also > >>> really like the current proposal. > >>> > >>> Thanks, > >>> Gijs > >>> > >>> On Fri, Aug 15, 2025 at 3:33 PM Andrew Lamb <[email protected]> > >> wrote: > >>> > >>>> I have started organizing a project[1] in arrow-rs 's Parquet reader > >> to try > >>>> and implement this proposal. > >>>> > >>>> Hopefully that can be 1 / 2 open source implementations needed. > >>>> > >>>> Thanks again for helping drive this along, > >>>> Andrew > >>>> > >>>> [1] https://github.com/apache/arrow-rs/issues/8156 > >>>> > >>>> On Wed, Aug 13, 2025 at 5:39 AM Jan Finis <[email protected]> wrote: > >>>> > >>>>> I have now tagged > >>>>> < > >>>> > >> > https://github.com/apache/parquet-format/pull/514#issuecomment-3182978173 > >>>>>> > >>>>> the people that argued for total order in the initial PR. Let's see > >> their > >>>>> response. > >>>>> > >>>>> If I understand the adoption process correctly, the next hurdle to > >>>> getting > >>>>> this adopted is two open > >>>>> source (!) implementations proving its feasibility. We already had > >> that > >>>> for > >>>>> IEEE total order. If we > >>>>> prefer the solution with nan counts, we'll need it there as well. I > >>>> myself > >>>>> work on a proprietary > >>>>> implementation, so I'm counting on others here :). Be prepared > >> though, > >>>> this > >>>>> will likely take months > >>>>> unless the interest in this topic has risen to a point where people > >> are > >>>>> eager to jump on the implementation > >>>>> right away. > >>>>> > >>>>> So, I guess it will take some months of soaking time before any > >> formal > >>>> vote > >>>>> can be done > >>>>> (given that we reach consensus that this is what we want and we find > >>>> people > >>>>> for the implementations). > >>>>> > >>>>> Cheers, > >>>>> Jan > >>>>> > >>>>> Am Mi., 13. Aug. 2025 um 01:18 Uhr schrieb Ryan Blue < > >> [email protected]>: > >>>>> > >>>>>> Thanks, Jan. I also went through the combined proposal and it looks > >>>>> mostly > >>>>>> good to me. > >>>>>> > >>>>>>> First of all, to make it quick: Yes, the solution of having > >>>> nan_counts > >>>>>> *and* total order, which was brought up multiple times, does work > >> and > >>>>>> solves more cases than just either of both. > >>>>>> > >>>>>> Great, then we have a solution for both filtering use cases and for > >>>>> moving > >>>>>> ahead with total order. And thanks to Andrew for suggesting this as > >>>> well > >>>>> on > >>>>>> the second PR. I think this also looks like this is something that > >>>> Orson > >>>>> is > >>>>>> okay with given his comments on the latest PR. > >>>>>> > >>>>>> Is there anyone against the combined approach? I don't see a big > >>>> downside > >>>>>> for anyone. It is compatible with previous stats rules, has a NaN > >>>> count, > >>>>>> and allows using either type-specific order or total order. > >>>>>> > >>>>>> Assuming that this satisfies the big objections, I think we should > >> wait > >>>>> for > >>>>>> a few days to make sure everyone has time to check out the new PR > >> and > >>>>> then > >>>>>> vote to adopt it. > >>>>>> > >>>>>> Ryan > >>>>>> > >>>>>> On Mon, Aug 11, 2025 at 6:03 AM Andrew Lamb < > >> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Thank you Jan -- I read through the new combined proposal, and I > >>>>> thought > >>>>>> it > >>>>>>> looks good and addresses the feedback so far. I left some small > >> style > >>>>>>> suggestions, but nothing that is required from my perspective > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Sat, Aug 9, 2025 at 9:07 AM Jan Finis <[email protected]> > >> wrote: > >>>>>>> > >>>>>>>> Hey Ryan, > >>>>>>>> > >>>>>>>> Thanks for chiming in. First of all, to make it quick: Yes, the > >>>>>> solution > >>>>>>> of > >>>>>>>> having nan_counts *and* total order, which was brought up > >> multiple > >>>>>> times, > >>>>>>>> does work and solves more cases than just either of both. > >>>>>>>> > >>>>>>>> I strongly prefer continuing to discuss the merits of these > >>>>> approaches > >>>>>>>>> rather than trying to decide with a vote. > >>>>>>>> > >>>>>>>> > >>>>>>>> In theory, I agree that it isn't good to silence a discussion > >> by > >>>> just > >>>>>>>> voting for one possible solution and technical issues should be > >>>>>>> discussed. > >>>>>>>> However, please note that we have been circling on this for > >> over > >>>> two > >>>>>>> years > >>>>>>>> now, including an extended discussion that brought up all > >> arguments > >>>>>>>> multiple times. This is in stark contrast to the > >>>>>>>> speed with which you guys work on the Iceberg spec, for > >> example. > >>>>> There, > >>>>>>> you > >>>>>>>> also do not discuss the merits of various solutions for > >> multiple > >>>>> years. > >>>>>>> You > >>>>>>>> just pick one and merge it after a *reasonable* time of > >> discussion. > >>>>>>>> If you had the speed we currently have here, nothing would get > >>>> done. > >>>>>>> Thus, > >>>>>>>> I see this as a clear case of *"the perfect is the enemy of the > >>>>> good"*. > >>>>>>>> Yes, we can continue looking for the perfect solution, > >>>>>>>> but that will likely lead to keeping us at the status quo, > >> which is > >>>>> the > >>>>>>>> worst of them all. > >>>>>>>> > >>>>>>>> That being said, I'm also happy to create a PR which does both > >>>> total > >>>>>>> order > >>>>>>>> and NaN counts; after all, I just want the issue solved and all > >>>> these > >>>>>>>> solutions are better than the status quo. > >>>>>>>> > >>>>>>>> *As this was now suggest by at least three people, I guess it's > >>>> worth > >>>>>>>> doing, so here you go: > >>>>>> https://github.com/apache/parquet-format/pull/514 > >>>>>>>> <https://github.com/apache/parquet-format/pull/514>* > >>>>>>>> > >>>>>>>> With this, we should have PRs covering most of the solution > >> space. > >>>>>>>> (I'm refusing to create a PR with negative and positive > >> nan_counts; > >>>>>>>> nan_counts + total order has to suffice; the complexity > >> madness has > >>>>> to > >>>>>>> stop > >>>>>>>> somewhere) > >>>>>>>> I still believe that there was an amount of people who already > >>>> found > >>>>>>>> nan_counts too complex and therefore wanted IEEE total order, > >> and > >>>>> these > >>>>>>>> people may not like putting on extra complexity, > >>>>>>>> but let's see, maybe some have also changed their opinion in > >> the > >>>>>>> meantime. > >>>>>>>> > >>>>>>>> > >>>>>>>> *Given all this, we can also first do an informal vote where > >>>> everyone > >>>>>> can > >>>>>>>> vote for which of the three their favorite would be.Maybe a > >> clear > >>>>>>> favorite > >>>>>>>> will emerge and then we can vote on this one.* > >>>>>>>> > >>>>>>>> But of course, we can also take some weeks to discuss the three > >>>>>>> solutions, > >>>>>>>> now that we have PRs for all of them. I just hope this won't > >> make > >>>> us > >>>>>>>> continue for another 2 years, or an > >>>>>>>> infinite stalemate where each solution is vetoed by a PMC > >> member. > >>>>>>>> (Sorry for becoming a bit cynical here; I have just spent way > >> too > >>>>> much > >>>>>>> time > >>>>>>>> of my life with double statistics at this point ;) ...) > >>>>>>>> > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Jan > >>>>>>>> > >>>>>>>> Am Fr., 8. Aug. 2025 um 23:38 Uhr schrieb Ryan Blue < > >>>>> [email protected] > >>>>>>> : > >>>>>>>> > >>>>>>>>> Regarding the process for this, I strongly prefer continuing > >> to > >>>>>> discuss > >>>>>>>> the > >>>>>>>>> merits of these approaches rather than trying to decide with > >> a > >>>>> vote. > >>>>>> I > >>>>>>>>> don't think it is a good practice to use a vote to decide on > >> a > >>>>>>> technical > >>>>>>>>> direction. There are very few situations that warrant it and > >> I > >>>>> don't > >>>>>>>> think > >>>>>>>>> that this is one of them. While this issue has been open for > >> a > >>>> long > >>>>>>> time, > >>>>>>>>> that appears to be the result of it not being anyone's top > >>>> priority > >>>>>>>> rather > >>>>>>>>> than indecision. > >>>>>>>>> > >>>>>>>>> For the technical merits of these approaches, I think that > >> we can > >>>>>> find > >>>>>>> a > >>>>>>>>> middle ground. I agree with Jan that when working with sorted > >>>>> values, > >>>>>>> we > >>>>>>>>> need to know how NaN values were handled and that requires > >> using > >>>> a > >>>>>>>>> well-defined order that includes NaN and its variations > >> (because > >>>> we > >>>>>>>> should > >>>>>>>>> not normalize). Using NaN count is not sufficient for > >> ordering > >>>>> rows. > >>>>>>>>> > >>>>>>>>> Gijs also brings up good points about how NaN values show up > >> in > >>>>>> actual > >>>>>>>>> datasets: not just when used in place of null, but also as > >> the > >>>>> result > >>>>>>> of > >>>>>>>>> normal calculations on abnormal data, like `sqrt(-4.0)` or > >>>>>> `log(-1.0)`. > >>>>>>>>> Both of those present problems when mixed with valid data > >> because > >>>>> of > >>>>>>> the > >>>>>>>>> stats "poisoning" problem, where the range of valid data is > >>>> usable > >>>>>>> until > >>>>>>>> a > >>>>>>>>> single NaN is mixed in. > >>>>>>>>> > >>>>>>>>> Another issue is that NaN is error-prone because "regular" > >>>>> comparison > >>>>>>> is > >>>>>>>>> always false: > >>>>>>>>> ``` > >>>>>>>>> Math.log(-1.0) >= 2 => FALSE > >>>>>>>>> Math.log(-1.0) < 2 => FALSE > >>>>>>>>> 2 > Math.log(-1.0) => FALSE > >>>>>>>>> ``` > >>>>>>>>> > >>>>>>>>> As a result, Iceberg doesn't trust NaN values as either > >> lower or > >>>>>> upper > >>>>>>>>> bounds because we don't want to go back to the code that > >> produced > >>>>> the > >>>>>>>> value > >>>>>>>>> to see what the comparison order was to determine whether NaN > >>>>> values > >>>>>> go > >>>>>>>>> before or after others. > >>>>>>>>> > >>>>>>>>> Total order solves the second issue in theory, but regular > >>>>> comparison > >>>>>>> is > >>>>>>>>> prevalent and not obvious to developers. And it also doesn't > >> help > >>>>>> when > >>>>>>>> NaN > >>>>>>>>> is used instead of null. So using total order is not > >> sufficient > >>>> for > >>>>>>> data > >>>>>>>>> skipping. > >>>>>>>>> > >>>>>>>>> I think the right compromise is to use `min`, `max`, and > >>>>> `nan_count` > >>>>>>> for > >>>>>>>>> data skipping stats (where min and max cannot be NaN) and > >> total > >>>>>>> ordering > >>>>>>>>> for sorting values. That satisfies the data skipping use > >> cases > >>>> and > >>>>>> also > >>>>>>>>> gives us an ordering of unaltered values that we can reason > >>>> about. > >>>>>>>>> > >>>>>>>>> Does anyone think that doesn't work? > >>>>>>>>> > >>>>>>>>> Ryan > >>>>>>>>> > >>>>>>>>> On Fri, Aug 1, 2025 at 8:57 AM Gang Wu <[email protected]> > >> wrote: > >>>>>>>>> > >>>>>>>>>> Thanks Jan for your endless effort on this! > >>>>>>>>>> > >>>>>>>>>> I'm in favor of simplicity and generalism. I think we have > >>>>> already > >>>>>>>>> debated > >>>>>>>>>> a lot > >>>>>>>>>> for `nan_count` in [1] and [2] is the reflection of those > >>>>>>> discussions. > >>>>>>>>>> Therefore > >>>>>>>>>> I am inclined to start a vote for [2] unless there is a > >>>>>> significantly > >>>>>>>>>> better > >>>>>>>>>> proposal. > >>>>>>>>>> > >>>>>>>>>> I would suggest everyone interested in this discussion to > >>>> attend > >>>>>> the > >>>>>>>>>> scheduled > >>>>>>>>>> sync on Aug 6th (detailed below) to spread the word to the > >>>>> broader > >>>>>>>>>> community. > >>>>>>>>>> If we can get a consensus on [2], I can help start the > >> vote and > >>>>>> move > >>>>>>>>>> forward. > >>>>>>>>>> > >>>>>>>>>> *Apache Parquet Community Sync Wednesday, August 6 · 10:00 > >> – > >>>>>> 11:00am > >>>>>>> * > >>>>>>>>>> *Time zone: America/Los_Angeles* > >>>>>>>>>> *Google Meet joining info Video call link: > >>>>>>>>>> https://meet.google.com/bhe-rvan-qjk > >>>>>>>>>> <https://meet.google.com/bhe-rvan-qjk> * > >>>>>>>>>> > >>>>>>>>>> [1] https://github.com/apache/parquet-format/pull/196 > >>>>>>>>>> [2] https://github.com/apache/parquet-format/pull/221 > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Gang > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Fri, Aug 1, 2025 at 6:16 PM Jan Finis < > >> [email protected]> > >>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Gijs, > >>>>>>>>>>> > >>>>>>>>>>> Thank you for bringing up concrete points, I'm happy to > >>>> discuss > >>>>>>> them > >>>>>>>> in > >>>>>>>>>>> detail. > >>>>>>>>>>> > >>>>>>>>>>> NaNs are less common in the SQL world than in the > >> DataFrame > >>>>> world > >>>>>>>> where > >>>>>>>>>>>> NaNs were used for a long time to represent missing > >> values. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> You could transcode between NULL to NaN before reading > >> and > >>>>>> writing > >>>>>>> to > >>>>>>>>>>> Parquet. You basically mention yourself that NaNs were > >> used > >>>> for > >>>>>>>> missing > >>>>>>>>>>> values, i.e., what is commonly a NULL, which wasn't > >>>> available. > >>>>>> So, > >>>>>>>>>>> semantically, transcoding to NULL would even be the sane > >>>> thing > >>>>> to > >>>>>>> do. > >>>>>>>>>> Yes, > >>>>>>>>>>> that will cost you some cycles, but should be a rather > >>>>>> lightweight > >>>>>>>>>>> operation in comparison to most other operations, so I > >> would > >>>>>> argue > >>>>>>>> that > >>>>>>>>>> it > >>>>>>>>>>> won't totally ruin your performance. Similarly, why > >> should > >>>>>> Parquet > >>>>>>>> play > >>>>>>>>>>> along with a "hack" that was done in other frameworks > >> due to > >>>>>>>>> shortcomings > >>>>>>>>>>> of those frameworks? So from a philosophical point of > >> view, I > >>>>>> think > >>>>>>>>>>> supporting NaNs better is the wrong thing to do. Rather, > >> we > >>>>>> should > >>>>>>>> be a > >>>>>>>>>>> forcing function to align others to better behavior, so > >>>>> appling a > >>>>>>> bit > >>>>>>>>> of > >>>>>>>>>>> force might in the long run make people use NULLs also in > >>>>>>> DataFrames. > >>>>>>>>>>> > >>>>>>>>>>> Of course, your argument also goes into the direction of > >>>>>>> pragmatism: > >>>>>>>>> If a > >>>>>>>>>>> large part of the data science world uses NaNs to encode > >>>>> missing > >>>>>>>>> values, > >>>>>>>>>>> then maybe Parquet should accept this de-facto standard > >>>> rather > >>>>>> than > >>>>>>>>>>> fighting it. That is indeed a valid point. The weight of > >> it > >>>> is > >>>>>>>>> debatable > >>>>>>>>>>> and my personal conclusion is that it's still not worth > >> it, > >>>> as > >>>>>> you > >>>>>>>> can > >>>>>>>>>>> transcode between NULLs and NaNs, but I do agree with its > >>>>>> validity. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Since the proposal phrases it as a goal to work > >> "regardless > >>>> of > >>>>>> how > >>>>>>>> they > >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels > >>>>>> out-of-place > >>>>>>> to > >>>>>>>>> me. > >>>>>>>>>>>> Most hardware and most people don't care about total > >>>> ordering > >>>>>> and > >>>>>>>>>> needing > >>>>>>>>>>>> to take it into account while filtering using > >> statistics > >>>>> seems > >>>>>>> like > >>>>>>>>>>>> preferring the special case instead of the common case. > >>>>> Almost > >>>>>>>> noone > >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL > >> engines > >>>> that > >>>>>>> don't > >>>>>>>>>> have > >>>>>>>>>>>> IEEE total ordering as their default ordering for > >> floats > >>>> will > >>>>>>> also > >>>>>>>>> need > >>>>>>>>>>> to > >>>>>>>>>>>> do more special handling for this. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I disagree with the conclusion this statement draws. The > >>>>> current > >>>>>>>>>> behavior, > >>>>>>>>>>> and nan_counts without total ordering, pose a real > >> problem > >>>>> here, > >>>>>>> even > >>>>>>>>> for > >>>>>>>>>>> engines that don't care about bit patterns. I do agree > >> that > >>>>> most > >>>>>>>>> database > >>>>>>>>>>> engines, including the one I'm working on, do not care > >> about > >>>>> bit > >>>>>>>>> patterns > >>>>>>>>>>> and/or sign bits. However, how can our database engine > >> know > >>>>>> whether > >>>>>>>> the > >>>>>>>>>>> writer of a Parquet file saw it the same way? It can't. > >>>>>> Therefore, > >>>>>>> it > >>>>>>>>>>> cannot know whether a writer, for example, ordered NaNs > >>>> before > >>>>> or > >>>>>>>> after > >>>>>>>>>> all > >>>>>>>>>>> other numbers, or maybe ordered them by sign bit. So, if > >> our > >>>>>>> database > >>>>>>>>>>> engine now sees a float column in sorting columns, it > >> cannot > >>>>>> apply > >>>>>>>> any > >>>>>>>>>>> optimization without a lot of special casing, as it > >> doesn't > >>>>> know > >>>>>>>>> whether > >>>>>>>>>>> NaNs will be before all other values, after all other > >> values, > >>>>> or > >>>>>>>> maybe > >>>>>>>>>>> both, depending on sign bit. It could apply contrived > >> logic > >>>>> that > >>>>>>>> tries > >>>>>>>>> to > >>>>>>>>>>> infer where NaNs were placed from the NaN counts of the > >> first > >>>>> and > >>>>>>>> last > >>>>>>>>>>> page, but doing so will be a lot of ugly code that also > >> feels > >>>>> to > >>>>>> be > >>>>>>>> in > >>>>>>>>>> the > >>>>>>>>>>> wrong place. I.e., I don't want to need to load pages or > >> the > >>>>> page > >>>>>>>>> index, > >>>>>>>>>>> just to reason about a sort order. > >>>>>>>>>>> > >>>>>>>>>>> SQL engines that don't have > >>>>>>>>>>>> IEEE total ordering as their default ordering for > >> floats > >>>> will > >>>>>>> also > >>>>>>>>> need > >>>>>>>>>>> to > >>>>>>>>>>>> do more special handling for this. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> This code, which I would indeed need to write for our > >> engine, > >>>>> is > >>>>>>>>>> comparably > >>>>>>>>>>> trivial. Simply choose the largest possible bit pattern > >> as > >>>>>>> comparison > >>>>>>>>> for > >>>>>>>>>>> upper bounds filtering for NaN, and the smallest > >> possible bit > >>>>>>> pattern > >>>>>>>>> for > >>>>>>>>>>> lower bounds. It's not more than a few lines of code that > >>>> check > >>>>>>>>> whether a > >>>>>>>>>>> filter is NaN and then replace its value with the > >>>>> highest/lowest > >>>>>>> NaN > >>>>>>>>> bit > >>>>>>>>>>> pattern. It is similarly trivial to the special casing I > >> need > >>>>> to > >>>>>> do > >>>>>>>>> with > >>>>>>>>>>> nan_counts, and it is way more trivial than the extra > >> code I > >>>>>> would > >>>>>>>> need > >>>>>>>>>> to > >>>>>>>>>>> write for sorting columns, as depicted above. > >>>>>>>>>>> > >>>>>>>>>>> From a Polars perspective, having a `nan_count` and > >> defining > >>>>> what > >>>>>>>>>>>> happens to the `min` and `max` statistics when a page > >>>>> contains > >>>>>>> only > >>>>>>>>>> NaNs > >>>>>>>>>>> is > >>>>>>>>>>>> enough to allow for all predicate filtering. I think, > >> but > >>>>>> correct > >>>>>>>> me > >>>>>>>>>> if I > >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that > >>>> don't > >>>>>> use > >>>>>>>>> total > >>>>>>>>>>>> ordering. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> It's not fully enough, as depicted above. Sorting columns > >>>> would > >>>>>>> still > >>>>>>>>> not > >>>>>>>>>>> work properly. > >>>>>>>>>>> > >>>>>>>>>>> As for ways forward, I propose merging the `nan_count` > >> and > >>>>> `sort > >>>>>>>>>> ordering` > >>>>>>>>>>>> proposals into one to make one proposal > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Note that the initial reason for proposing IEEE total > >> order > >>>> was > >>>>>>> that > >>>>>>>>>> people > >>>>>>>>>>> in the discussion threads found nan_counts to be too > >> complex > >>>>> and > >>>>>>> too > >>>>>>>>> much > >>>>>>>>>>> of an undeserving special case (re-read the discussion > >> in the > >>>>>>> initial > >>>>>>>>> PR > >>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196> to > >> see > >>>> the > >>>>>>>>>>> rationales). > >>>>>>>>>>> So merging both together would go totally against the > >> spirit > >>>> of > >>>>>> why > >>>>>>>>> IEEE > >>>>>>>>>>> total order was proposed. While it has further upsides, > >> the > >>>>> main > >>>>>>>> reason > >>>>>>>>>> was > >>>>>>>>>>> indeed to *not have* nan_counts. If now the proposal > >> would > >>>> even > >>>>>> go > >>>>>>> to > >>>>>>>>>>> positive and negative nan counts (i.e., even more > >>>> complexity), > >>>>>> this > >>>>>>>>> would > >>>>>>>>>>> go 180 degrees into the opposite direction of why people > >>>> wanted > >>>>>>> total > >>>>>>>>>> order > >>>>>>>>>>> in the first place. > >>>>>>>>>>> > >>>>>>>>>>> Cheers, > >>>>>>>>>>> Jan > >>>>>>>>>>> > >>>>>>>>>>> Am Do., 31. Juli 2025 um 23:23 Uhr schrieb Gijs Burghoorn > >>>>>>>>>>> <[email protected]>: > >>>>>>>>>>> > >>>>>>>>>>>> Hello Jan and others, > >>>>>>>>>>>> > >>>>>>>>>>>> First, let me preface by saying I am quite new here. > >> So I > >>>>>>> apologize > >>>>>>>>> if > >>>>>>>>>>>> there is some other better way to bring up these > >> concerns. > >>>> I > >>>>>>>>> understand > >>>>>>>>>>> it > >>>>>>>>>>>> is very annoying to come in at the 11th hour and start > >>>>> bringing > >>>>>>> up > >>>>>>>> a > >>>>>>>>>>> bunch > >>>>>>>>>>>> of concerns, but I would also like this to be done > >> right. A > >>>>>>>> colleague > >>>>>>>>>> of > >>>>>>>>>>>> mine brought up some concerns and alternative > >> approaches in > >>>>> the > >>>>>>>>> GitHub > >>>>>>>>>>>> thread; I will file some of the concerns here as a > >>>> response. > >>>>>>>>>>>> > >>>>>>>>>>>>> Treating NaNs so specially is giving them attention > >> they > >>>>>> don't > >>>>>>>>>> deserve. > >>>>>>>>>>>> Most data sets do not contain NaNs. If a use case > >> really > >>>>>> requires > >>>>>>>>> them > >>>>>>>>>>> and > >>>>>>>>>>>> needs filtering to ignore them, they can store NULL > >>>> instead, > >>>>> or > >>>>>>>>> encode > >>>>>>>>>>> them > >>>>>>>>>>>> differently. I would prefer the average case over the > >>>> special > >>>>>>> case > >>>>>>>>>> here. > >>>>>>>>>>>> > >>>>>>>>>>>> NaNs are less common in the SQL world than in the > >> DataFrame > >>>>>> world > >>>>>>>>> where > >>>>>>>>>>>> NaNs were used for a long time to represent missing > >> values. > >>>>>> They > >>>>>>>>> still > >>>>>>>>>>>> exist with different canonical representations and > >>>> different > >>>>>> sign > >>>>>>>>>> bits. I > >>>>>>>>>>>> agree it might not be correct semantically, but sadly > >> that > >>>> is > >>>>>> the > >>>>>>>>> world > >>>>>>>>>>> we > >>>>>>>>>>>> deal with. NumPy and Numba do not have missing data > >>>>>>> functionality, > >>>>>>>>>> people > >>>>>>>>>>>> use NaNs there, and people definitely use that in their > >>>>>>> analytical > >>>>>>>>>>>> dataflows. Another point that was brought up in the GH > >>>>>> discussion > >>>>>>>> was > >>>>>>>>>>> "what > >>>>>>>>>>>> about infinity? You could argue that having infinity in > >>>>>>> statistics > >>>>>>>> is > >>>>>>>>>>>> similarly unuseful as it's too wide of a bound". I > >> would > >>>>> argue > >>>>>>> that > >>>>>>>>>>>> infinity is very different as there is no discussion on > >>>> what > >>>>>> the > >>>>>>>>>> ordering > >>>>>>>>>>>> or pattern of infinity is. Everyone agrees that > >> `min(1.0, > >>>>> inf, > >>>>>>>> -inf) > >>>>>>>>> == > >>>>>>>>>>>> -inf` and each infinity only has a single bit pattern. > >>>>>>>>>>>> > >>>>>>>>>>>>> It gives a defined order to every bit pattern and > >> thus > >>>>>> yields a > >>>>>>>>> total > >>>>>>>>>>>> order, mathematically speaking, which has value by > >> itself. > >>>>> With > >>>>>>> NaN > >>>>>>>>>>> counts, > >>>>>>>>>>>> it was still undefined how different bit patterns of > >> NaNs > >>>>> were > >>>>>>>>> supposed > >>>>>>>>>>> to > >>>>>>>>>>>> be ordered, whether NaN was allowed to have a sign bit, > >>>> etc., > >>>>>>>> risking > >>>>>>>>>>> that > >>>>>>>>>>>> different engines could come to different results while > >>>>>> filtering > >>>>>>>> or > >>>>>>>>>>>> sorting values within a file. > >>>>>>>>>>>> > >>>>>>>>>>>> Since the proposal phrases it as a goal to work > >> "regardless > >>>>> of > >>>>>>> how > >>>>>>>>> they > >>>>>>>>>>>> order NaN w.r.t. other values" this statement feels > >>>>>> out-of-place > >>>>>>> to > >>>>>>>>> me. > >>>>>>>>>>>> Most hardware and most people don't care about total > >>>> ordering > >>>>>> and > >>>>>>>>>> needing > >>>>>>>>>>>> to take it into account while filtering using > >> statistics > >>>>> seems > >>>>>>> like > >>>>>>>>>>>> preferring the special case instead of the common case. > >>>>> Almost > >>>>>>>> noone > >>>>>>>>>>>> filters for specific NaN value bit-patterns. SQL > >> engines > >>>> that > >>>>>>> don't > >>>>>>>>>> have > >>>>>>>>>>>> IEEE total ordering as their default ordering for > >> floats > >>>> will > >>>>>>> also > >>>>>>>>> need > >>>>>>>>>>> to > >>>>>>>>>>>> do more special handling for this. > >>>>>>>>>>>> > >>>>>>>>>>>> I also agree with my colleague that doing an approach > >> that > >>>> is > >>>>>> 50% > >>>>>>>> of > >>>>>>>>>> the > >>>>>>>>>>>> way there will make the barrier to improving it to > >> what it > >>>>>>> actually > >>>>>>>>>>> should > >>>>>>>>>>>> be later on much higher. > >>>>>>>>>>>> > >>>>>>>>>>>> As for ways forward, I propose merging the `nan_count` > >> and > >>>>>> `sort > >>>>>>>>>>> ordering` > >>>>>>>>>>>> proposals into one to make one proposal, as they are > >> linked > >>>>>>>> together, > >>>>>>>>>> and > >>>>>>>>>>>> moving forward with one without knowing what will > >> happen to > >>>>> the > >>>>>>>> other > >>>>>>>>>>> seems > >>>>>>>>>>>> unwise. From a Polars perspective, having a > >> `nan_count` and > >>>>>>>> defining > >>>>>>>>>> what > >>>>>>>>>>>> happens to the `min` and `max` statistics when a page > >>>>> contains > >>>>>>> only > >>>>>>>>>> NaNs > >>>>>>>>>>> is > >>>>>>>>>>>> enough to allow for all predicate filtering. I think, > >> but > >>>>>> correct > >>>>>>>> me > >>>>>>>>>> if I > >>>>>>>>>>>> am wrong, this is also enough for all SQL engines that > >>>> don't > >>>>>> use > >>>>>>>>> total > >>>>>>>>>>>> ordering. But if you want to be impartial to the > >> engine's > >>>>>>>>>> floating-point > >>>>>>>>>>>> ordering and allow engines with total ordering to do > >>>>> inequality > >>>>>>>>> filters > >>>>>>>>>>>> when `nan_count > 0` you would need a > >> `positive_nan_count` > >>>>> and > >>>>>> a > >>>>>>>>>>>> `negative_nan_count`. I understand the downside with > >> Thrift > >>>>>>>>> complexity, > >>>>>>>>>>> but > >>>>>>>>>>>> introducing another sort order is also adding > >> complexity > >>>> just > >>>>>> in > >>>>>>> a > >>>>>>>>>>>> different place. > >>>>>>>>>>>> > >>>>>>>>>>>> I would really like to see this move forward, so I hope > >>>> these > >>>>>>>>> concerns > >>>>>>>>>>> help > >>>>>>>>>>>> move it forward towards a solution that works for > >> everyone. > >>>>>>>>>>>> > >>>>>>>>>>>> Kind regards, > >>>>>>>>>>>> Gijs > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Jul 31, 2025 at 7:46 PM Andrew Lamb < > >>>>>>>> [email protected]> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> I would also be in favor of starting a vote > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Jul 31, 2025 at 11:23 AM Jan Finis < > >>>>>> [email protected]> > >>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> As the author of both the IEEE754 total order > >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/221> > >> PR > >>>>> and > >>>>>>> the > >>>>>>>>>>> earlier > >>>>>>>>>>>>> PR > >>>>>>>>>>>>>> that basically proposed `nan_count` > >>>>>>>>>>>>>> <https://github.com/apache/parquet-format/pull/196 > >>> , > >>>> my > >>>>>>>> current > >>>>>>>>>> vote > >>>>>>>>>>>>> would > >>>>>>>>>>>>>> be for IEEE754 total order. > >>>>>>>>>>>>>> Consequently, I would like to request a formal > >> vote for > >>>>> the > >>>>>>> PR > >>>>>>>>>>>>> introducing > >>>>>>>>>>>>>> IEEE754 total order ( > >>>>>>>>>>> https://github.com/apache/parquet-format/pull/221 > >>>>>>>>>>>> ), > >>>>>>>>>>>>>> if > >>>>>>>>>>>>>> that is possible. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> My Rationales: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> - It's conceptually simpler. It's easier to > >> explain. > >>>>>> It's > >>>>>>>>> based > >>>>>>>>>> on > >>>>>>>>>>>> an > >>>>>>>>>>>>>> IEEE-standardized order predicate. > >>>>>>>>>>>>>> - There are already multiple implementations > >> showing > >>>>>>>>>> feasibility. > >>>>>>>>>>>> This > >>>>>>>>>>>>>> will likely make the adoption quicker. > >>>>>>>>>>>>>> - It gives a defined order to every bit pattern > >> and > >>>>> thus > >>>>>>>>> yields > >>>>>>>>>> a > >>>>>>>>>>>>> total > >>>>>>>>>>>>>> order, mathematically speaking, which has value > >> by > >>>>>> itself. > >>>>>>>>> With > >>>>>>>>>>> NaN > >>>>>>>>>>>>>> counts, > >>>>>>>>>>>>>> it was still undefined how different bit > >> patterns of > >>>>>> NaNs > >>>>>>>> were > >>>>>>>>>>>>> supposed > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>> be ordered, whether NaN was allowed to have a > >> sign > >>>>> bit, > >>>>>>>> etc., > >>>>>>>>>>>> risking > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>> different engines could come to different > >> results > >>>>> while > >>>>>>>>>> filtering > >>>>>>>>>>> or > >>>>>>>>>>>>>> sorting values within a file. > >>>>>>>>>>>>>> - It also solves sort order completely. With > >>>>> nan_counts > >>>>>>>> only, > >>>>>>>>> it > >>>>>>>>>>> is > >>>>>>>>>>>>>> still undefined whether nans should be sorted > >> before > >>>>> or > >>>>>>>> after > >>>>>>>>>> all > >>>>>>>>>>>>> values > >>>>>>>>>>>>>> (or both, depending on sign bit), so any file > >>>>> including > >>>>>>> NaNs > >>>>>>>>>> could > >>>>>>>>>>>> not > >>>>>>>>>>>>>> really leverage sort order without being > >> ambiguous. > >>>>>>>>>>>>>> - It's less complex in thrift. Having fields > >> that > >>>> only > >>>>>>> apply > >>>>>>>>> to > >>>>>>>>>> a > >>>>>>>>>>>>>> handful of data types is somehow weird. If every > >>>> type > >>>>>> did > >>>>>>>>> this, > >>>>>>>>>> we > >>>>>>>>>>>>> would > >>>>>>>>>>>>>> have a plethora of non-generic fields in thrift. > >>>>>>>>>>>>>> - Treating NaNs so specially is giving them > >>>> attention > >>>>>> they > >>>>>>>>> don't > >>>>>>>>>>>>>> deserve. Most data sets do not contain NaNs. If > >> a > >>>> use > >>>>>> case > >>>>>>>>>> really > >>>>>>>>>>>>>> requires > >>>>>>>>>>>>>> them and needs filtering to ignore them, they > >> can > >>>>> store > >>>>>>> NULL > >>>>>>>>>>>> instead, > >>>>>>>>>>>>>> or encode them differently. I would prefer the > >>>> average > >>>>>>> case > >>>>>>>>> over > >>>>>>>>>>> the > >>>>>>>>>>>>>> special case here. > >>>>>>>>>>>>>> - The majority of the people discussing this so > >> far > >>>>> seem > >>>>>>> to > >>>>>>>>>> favor > >>>>>>>>>>>>> total > >>>>>>>>>>>>>> order. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>> Jan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu > >> < > >>>>>>>>>> [email protected] > >>>>>>>>>>>> : > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> As this discussion has been open for more than > >> two > >>>>> years, > >>>>>>> I’d > >>>>>>>>>> like > >>>>>>>>>>> to > >>>>>>>>>>>>>> bump > >>>>>>>>>>>>>>> up > >>>>>>>>>>>>>>> this thread again to update the progress and > >> collect > >>>>>>>> feedback. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Background* > >>>>>>>>>>>>>>> • Today Parquet’s min/max stats and page index > >> omit > >>>>> NaNs > >>>>>>>>>> entirely. > >>>>>>>>>>>>>>> • Engines can’t safely prune floating values > >> because > >>>>> they > >>>>>>>> know > >>>>>>>>>>>> nothing > >>>>>>>>>>>>> on > >>>>>>>>>>>>>>> NaNs. > >>>>>>>>>>>>>>> • Column index is disabled if any page contains > >> only > >>>>>> NaNs. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> There are two active proposals as below: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Proposal A - IEEE754TotalOrder* (from the PR > >> [1]) > >>>>>>>>>>>>>>> • Define a new ColumnOrder to include +0, –0 and > >> all > >>>>> NaN > >>>>>>>>>>>> bit‐patterns. > >>>>>>>>>>>>>>> • Stats and column index store NaNs if they > >> appear. > >>>>>>>>>>>>>>> • Three PoC impls are ready: arrow-rs [2], > >> duckdb [3] > >>>>> and > >>>>>>>>>>>> parquet-java > >>>>>>>>>>>>>> [4]. > >>>>>>>>>>>>>>> • For more context of this approach, please > >> refer to > >>>>>>>> discussion > >>>>>>>>>> in > >>>>>>>>>>>> [5]. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Proposal B - add nan_count* (from a comment [6] > >> to > >>>>> [1]) > >>>>>>>>>>>>>>> • Add `nan_count` to stats and a `nan_counts` > >> list to > >>>>>>> column > >>>>>>>>>> index. > >>>>>>>>>>>>>>> • For all‐NaNs cases, write NaN to min/max and > >> use > >>>>>>> nan_count > >>>>>>>> to > >>>>>>>>>>>>>>> distinguish. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Both solutions have pros and cons but are way > >> better > >>>>> than > >>>>>>> the > >>>>>>>>>>> status > >>>>>>>>>>>>> quo > >>>>>>>>>>>>>>> today. > >>>>>>>>>>>>>>> Please share your thoughts on the two proposals > >>>> above, > >>>>> or > >>>>>>>> maybe > >>>>>>>>>>> come > >>>>>>>>>>>> up > >>>>>>>>>>>>>>> with > >>>>>>>>>>>>>>> better alternatives. We need consensus on one > >>>> proposal > >>>>>> and > >>>>>>>> move > >>>>>>>>>>>>> forward. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [1] > >>>> https://github.com/apache/parquet-format/pull/221 > >>>>>>>>>>>>>>> [2] https://github.com/apache/arrow-rs/pull/7408 > >>>>>>>>>>>>>>> [3] > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder > >>>>>>>>>>>>>>> [4] > >> https://github.com/apache/parquet-java/pull/3191 > >>>>>>>>>>>>>>> [5] > >>>> https://github.com/apache/parquet-format/pull/196 > >>>>>>>>>>>>>>> [6] > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>> Gang > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 4:22 PM Jan Finis < > >>>>>>> [email protected] > >>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Dear contributors, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> My PR has now gathered comments for a week and > >> the > >>>>> gist > >>>>>>> of > >>>>>>>>> all > >>>>>>>>>>> open > >>>>>>>>>>>>>>> issues > >>>>>>>>>>>>>>>> is the question of how to encode pages/column > >>>> chunks > >>>>>> that > >>>>>>>>>> contain > >>>>>>>>>>>>> only > >>>>>>>>>>>>>>>> NaNs. There are different suggestions and I > >> don't > >>>> see > >>>>>> one > >>>>>>>>>> common > >>>>>>>>>>>>>> favorite > >>>>>>>>>>>>>>>> yet. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I have outlined three alternatives of how we > >> can > >>>>> handle > >>>>>>>> these > >>>>>>>>>>> and I > >>>>>>>>>>>>>> want > >>>>>>>>>>>>>>> us > >>>>>>>>>>>>>>>> to reach a conclusion here, so I can update my > >> PR > >>>>>>>> accordingly > >>>>>>>>>> and > >>>>>>>>>>>>> move > >>>>>>>>>>>>>> on > >>>>>>>>>>>>>>>> with it. As this is my first contribution to > >>>>> parquet, I > >>>>>>>> don't > >>>>>>>>>>> know > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> decision processes here. Do we vote? Is there a > >>>>> single > >>>>>> or > >>>>>>>>> group > >>>>>>>>>>> of > >>>>>>>>>>>>>>> decision > >>>>>>>>>>>>>>>> makers? *Please let me know how to come to a > >>>>> conclusion > >>>>>>>> here; > >>>>>>>>>>> what > >>>>>>>>>>>>> are > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> next steps?* > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> For reference, here are the three alternatives > >> I > >>>>>> pointed > >>>>>>>> out. > >>>>>>>>>> You > >>>>>>>>>>>> can > >>>>>>>>>>>>>>> find > >>>>>>>>>>>>>>>> detailed description of their PROs and CONs in > >> my > >>>>>>> comment: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 1. My initial proposal, i.e., encoding only-NaN > >>>> pages > >>>>>> by > >>>>>>>>>>>> min=max=NaN. > >>>>>>>>>>>>>>>> 2. Adding `num_values` to the ColumnIndex, to > >> make > >>>> it > >>>>>>>>> symmetric > >>>>>>>>>>>> with > >>>>>>>>>>>>>>>> Statistics in pages & `ColumnMetaData` and to > >>>> enable > >>>>>> the > >>>>>>>>>>>> computation > >>>>>>>>>>>>>>>> `num_values - null_count - nan_count == 0` > >>>>>>>>>>>>>>>> 3. Adding a `nan_pages` bool list to the column > >>>>> index, > >>>>>>>> which > >>>>>>>>>>>>> indicates > >>>>>>>>>>>>>>>> whether a page contains only NaNs > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Cheers > >>>>>>>>>>>>>>>> Jan Finis > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > > >
