alamb commented on code in PR #183:
URL: https://github.com/apache/parquet-site/pull/183#discussion_r3326600464


##########
content/en/blog/features/ieee754-order.md:
##########
@@ -0,0 +1,88 @@
+---
+title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total 
Order and NaN Counts"
+date: 2026-05-29
+description: "How Apache Parquet resolves ambiguous floating-point statistics 
using IEEE 754 total order and explicit NaN counts for better query 
performance."

Review Comment:
   ```suggestion
   description: "How the Apache Parquet Community resolved potentially 
ambiguous floating-point statistics using IEEE 754 total order and explicit NaN 
counts"
   ```



##########
content/en/blog/features/ieee754-order.md:
##########
@@ -0,0 +1,88 @@
+---
+title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total 
Order and NaN Counts"
+date: 2026-05-29
+description: "How Apache Parquet resolves ambiguous floating-point statistics 
using IEEE 754 total order and explicit NaN counts for better query 
performance."
+author: "[Jan Finis](https://github.com/JFinis), [Ed 
Seidl](https://github.com/etseidl), [Gang Wu](https://github.com/wgtmac)"
+categories: ["features"]
+---
+
+Column statistics are the secret to Apache Parquet's blazing fast performance. 
By storing compact summaries—like `min`, `max`, and null counts—for row groups, 
column chunks, and pages, readers can easily skip irrelevant data that doesn't 
match a query.
+
+However, floating-point values throw a wrench into this simple model. The IEEE 
754 standard defines special values like `NaN` (Not a Number), signed zeros 
(`-0.0` and `+0.0`), and infinities. Their comparison rules don't play well 
with the simple "total order" (a strict smaller-to-larger ranking) expected by 
most data-pruning algorithms. To fix this, the Parquet community recently 
clarified the standard by combining IEEE 754 total order semantics with an 
explicit `nan_count` field in the statistics.
+
+The result is a much clearer contract between data writers and readers. 
Floating-point bounds can now be interpreted consistently, and readers can 
confidently determine if `NaN` values are present, without having to guess 
based solely on `min` and `max` bounds.
+
+## Why Floating-Point Statistics Need Special Handling
+
+For integers, strings, and many other straightforward types, Parquet 
statistics are simple: the writer records the absolute smallest and largest 
values, and the reader uses those bounds to decide if a query might find a 
match.
+
+Floating-point columns are trickier for two major reasons. First, `-0.0` and 
`+0.0` are considered equal in normal math operations, yet they possess 
distinct underlying bit patterns. A data format needs strict rules on how to 
order these values; otherwise, different libraries might generate conflicting 
statistics for the exact same underlying data.
+
+Second, `NaN` is completely unordered under standard IEEE 754 comparisons. 
Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. 
If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the 
resulting bounds might be useless for skipping data. Conversely, if a writer 
simply ignores `NaN` values, readers are left in the dark about whether any 
`NaN`s actually exist in the data block.

Review Comment:
   As above, I tthink this would be clearer with a motivating example



##########
content/en/blog/features/ieee754-order.md:
##########
@@ -0,0 +1,88 @@
+---
+title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total 
Order and NaN Counts"
+date: 2026-05-29
+description: "How Apache Parquet resolves ambiguous floating-point statistics 
using IEEE 754 total order and explicit NaN counts for better query 
performance."
+author: "[Jan Finis](https://github.com/JFinis), [Ed 
Seidl](https://github.com/etseidl), [Gang Wu](https://github.com/wgtmac)"
+categories: ["features"]
+---
+
+Column statistics are the secret to Apache Parquet's blazing fast performance. 
By storing compact summaries—like `min`, `max`, and null counts—for row groups, 
column chunks, and pages, readers can easily skip irrelevant data that doesn't 
match a query.
+
+However, floating-point values throw a wrench into this simple model. The IEEE 
754 standard defines special values like `NaN` (Not a Number), signed zeros 
(`-0.0` and `+0.0`), and infinities. Their comparison rules don't play well 
with the simple "total order" (a strict smaller-to-larger ranking) expected by 
most data-pruning algorithms. To fix this, the Parquet community recently 
clarified the standard by combining IEEE 754 total order semantics with an 
explicit `nan_count` field in the statistics.
+
+The result is a much clearer contract between data writers and readers. 
Floating-point bounds can now be interpreted consistently, and readers can 
confidently determine if `NaN` values are present, without having to guess 
based solely on `min` and `max` bounds.
+
+## Why Floating-Point Statistics Need Special Handling
+
+For integers, strings, and many other straightforward types, Parquet 
statistics are simple: the writer records the absolute smallest and largest 
values, and the reader uses those bounds to decide if a query might find a 
match.

Review Comment:
   I think this section would be stronger with a specific example - like maybe 
two columns of floating points, one with Nans and one without and then a 
predicate like `where x > 1.0`
   
   My understanding is that the column of floats without the Nan could be  
proven to match *all rows* and thus the predicate can be avoiding during 
execution
   
   However, the column of floats *with* a Nan doesn't match all rows 
   
   Something like this
   
   ```
   100.0
   200.0
   Nan.  <-- needs to be filtered out
   300.0
   ```
   
   Previously most parquet writers woudl write stats like
   ```
   min: 100.0
   max: 300.0
   ```
   
   And a clever engine might conclude that *all* rows match (for example the 
optimization descrbed by @xudong963 in 
https://datafusion.apache.org/blog/2026/03/20/limit-pruning/) which in this 
case is incorrect
   
   The engine needs to know if any Nans appear in the data 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to