alamb commented on code in PR #183: URL: https://github.com/apache/parquet-site/pull/183#discussion_r3326600464
########## content/en/blog/features/ieee754-order.md: ########## @@ -0,0 +1,88 @@ +--- +title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts" +date: 2026-05-29 +description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance." Review Comment: ```suggestion description: "How the Apache Parquet Community resolved potentially ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts" ``` ########## content/en/blog/features/ieee754-order.md: ########## @@ -0,0 +1,88 @@ +--- +title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts" +date: 2026-05-29 +description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance." +author: "[Jan Finis](https://github.com/JFinis), [Ed Seidl](https://github.com/etseidl), [Gang Wu](https://github.com/wgtmac)" +categories: ["features"] +--- + +Column statistics are the secret to Apache Parquet's blazing fast performance. By storing compact summaries—like `min`, `max`, and null counts—for row groups, column chunks, and pages, readers can easily skip irrelevant data that doesn't match a query. + +However, floating-point values throw a wrench into this simple model. The IEEE 754 standard defines special values like `NaN` (Not a Number), signed zeros (`-0.0` and `+0.0`), and infinities. Their comparison rules don't play well with the simple "total order" (a strict smaller-to-larger ranking) expected by most data-pruning algorithms. To fix this, the Parquet community recently clarified the standard by combining IEEE 754 total order semantics with an explicit `nan_count` field in the statistics. + +The result is a much clearer contract between data writers and readers. Floating-point bounds can now be interpreted consistently, and readers can confidently determine if `NaN` values are present, without having to guess based solely on `min` and `max` bounds. + +## Why Floating-Point Statistics Need Special Handling + +For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match. + +Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data. + +Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block. Review Comment: As above, I tthink this would be clearer with a motivating example ########## content/en/blog/features/ieee754-order.md: ########## @@ -0,0 +1,88 @@ +--- +title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts" +date: 2026-05-29 +description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance." +author: "[Jan Finis](https://github.com/JFinis), [Ed Seidl](https://github.com/etseidl), [Gang Wu](https://github.com/wgtmac)" +categories: ["features"] +--- + +Column statistics are the secret to Apache Parquet's blazing fast performance. By storing compact summaries—like `min`, `max`, and null counts—for row groups, column chunks, and pages, readers can easily skip irrelevant data that doesn't match a query. + +However, floating-point values throw a wrench into this simple model. The IEEE 754 standard defines special values like `NaN` (Not a Number), signed zeros (`-0.0` and `+0.0`), and infinities. Their comparison rules don't play well with the simple "total order" (a strict smaller-to-larger ranking) expected by most data-pruning algorithms. To fix this, the Parquet community recently clarified the standard by combining IEEE 754 total order semantics with an explicit `nan_count` field in the statistics. + +The result is a much clearer contract between data writers and readers. Floating-point bounds can now be interpreted consistently, and readers can confidently determine if `NaN` values are present, without having to guess based solely on `min` and `max` bounds. + +## Why Floating-Point Statistics Need Special Handling + +For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match. Review Comment: I think this section would be stronger with a specific example - like maybe two columns of floating points, one with Nans and one without and then a predicate like `where x > 1.0` My understanding is that the column of floats without the Nan could be proven to match *all rows* and thus the predicate can be avoiding during execution However, the column of floats *with* a Nan doesn't match all rows Something like this ``` 100.0 200.0 Nan. <-- needs to be filtered out 300.0 ``` Previously most parquet writers woudl write stats like ``` min: 100.0 max: 300.0 ``` And a clever engine might conclude that *all* rows match (for example the optimization descrbed by @xudong963 in https://datafusion.apache.org/blog/2026/03/20/limit-pruning/) which in this case is incorrect The engine needs to know if any Nans appear in the data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
