I would also be in favor of starting a vote On Thu, Jul 31, 2025 at 11:23 AM Jan Finis <jpfi...@gmail.com> wrote:
> As the author of both the IEEE754 total order > <https://github.com/apache/parquet-format/pull/221> PR and the earlier PR > that basically proposed `nan_count` > <https://github.com/apache/parquet-format/pull/196>, my current vote would > be for IEEE754 total order. > Consequently, I would like to request a formal vote for the PR introducing > IEEE754 total order (https://github.com/apache/parquet-format/pull/221), > if > that is possible. > > My Rationales: > > - It's conceptually simpler. It's easier to explain. It's based on an > IEEE-standardized order predicate. > - There are already multiple implementations showing feasibility. This > will likely make the adoption quicker. > - It gives a defined order to every bit pattern and thus yields a total > order, mathematically speaking, which has value by itself. With NaN > counts, > it was still undefined how different bit patterns of NaNs were supposed > to > be ordered, whether NaN was allowed to have a sign bit, etc., risking > that > different engines could come to different results while filtering or > sorting values within a file. > - It also solves sort order completely. With nan_counts only, it is > still undefined whether nans should be sorted before or after all values > (or both, depending on sign bit), so any file including NaNs could not > really leverage sort order without being ambiguous. > - It's less complex in thrift. Having fields that only apply to a > handful of data types is somehow weird. If every type did this, we would > have a plethora of non-generic fields in thrift. > - Treating NaNs so specially is giving them attention they don't > deserve. Most data sets do not contain NaNs. If a use case really > requires > them and needs filtering to ignore them, they can store NULL instead, > or encode them differently. I would prefer the average case over the > special case here. > - The majority of the people discussing this so far seem to favor total > order. > > Cheers, > Jan > > Am Sa., 26. Juli 2025 um 17:38 Uhr schrieb Gang Wu <ust...@gmail.com>: > > > Hi all, > > > > As this discussion has been open for more than two years, I’d like to > bump > > up > > this thread again to update the progress and collect feedback. > > > > *Background* > > • Today Parquet’s min/max stats and page index omit NaNs entirely. > > • Engines can’t safely prune floating values because they know nothing on > > NaNs. > > • Column index is disabled if any page contains only NaNs. > > > > There are two active proposals as below: > > > > *Proposal A - IEEE754TotalOrder* (from the PR [1]) > > • Define a new ColumnOrder to include +0, –0 and all NaN bit‐patterns. > > • Stats and column index store NaNs if they appear. > > • Three PoC impls are ready: arrow-rs [2], duckdb [3] and parquet-java > [4]. > > • For more context of this approach, please refer to discussion in [5]. > > > > *Proposal B - add nan_count* (from a comment [6] to [1]) > > • Add `nan_count` to stats and a `nan_counts` list to column index. > > • For all‐NaNs cases, write NaN to min/max and use nan_count to > > distinguish. > > > > Both solutions have pros and cons but are way better than the status quo > > today. > > Please share your thoughts on the two proposals above, or maybe come up > > with > > better alternatives. We need consensus on one proposal and move forward. > > > > [1] https://github.com/apache/parquet-format/pull/221 > > [2] https://github.com/apache/arrow-rs/pull/7408 > > [3] > > > https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder > > [4] https://github.com/apache/parquet-java/pull/3191 > > [5] https://github.com/apache/parquet-format/pull/196 > > [6] > > > https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077 > > > > Best, > > Gang > > > > On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <jpfi...@gmail.com> wrote: > > > > > Dear contributors, > > > > > > My PR has now gathered comments for a week and the gist of all open > > issues > > > is the question of how to encode pages/column chunks that contain only > > > NaNs. There are different suggestions and I don't see one common > favorite > > > yet. > > > > > > I have outlined three alternatives of how we can handle these and I > want > > us > > > to reach a conclusion here, so I can update my PR accordingly and move > on > > > with it. As this is my first contribution to parquet, I don't know the > > > decision processes here. Do we vote? Is there a single or group of > > decision > > > makers? *Please let me know how to come to a conclusion here; what are > > the > > > next steps?* > > > > > > For reference, here are the three alternatives I pointed out. You can > > find > > > detailed description of their PROs and CONs in my comment: > > > > > > https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 > > > > > > 1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN. > > > 2. Adding `num_values` to the ColumnIndex, to make it symmetric with > > > Statistics in pages & `ColumnMetaData` and to enable the computation > > > `num_values - null_count - nan_count == 0` > > > 3. Adding a `nan_pages` bool list to the column index, which indicates > > > whether a page contains only NaNs > > > > > > > > > Cheers > > > Jan Finis > > > > > >