emkornfield commented on code in PR #514:
URL: https://github.com/apache/parquet-format/pull/514#discussion_r2968476325


##########
src/main/thrift/parquet.thrift:
##########
@@ -1111,23 +1122,99 @@ union ColumnOrder {
    *      64-bit signed integer (nanos)
    *    See https://github.com/apache/parquet-format/issues/502 for more 
details
    *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because TYPE_ORDER is ambiguous for floating point types due to
+   *     underspecified handling of NaN and -0/+0, it is recommended that 
writers
+   *     use IEEE_754_TOTAL_ORDER for these types.
+   *
+   *     If TYPE_ORDER is used for floating point types, then the following
    *     compatibility rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-null
+   *       values are NaN.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
    *     - When looking for NaN values, min and max should be ignored.
+   *       If the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - Always set the nan_count field for floating point types, especially
+   *       even if it is zero.
+   *     - NaNs should not be written to min or max statistics fields except
+   *       in the column index when a page contains only NaN values. In this
+   *       case, since min_values and max_values are required, a NaN value
+   *       must be written.
    *     - If the computed max value is zero (whether negative or positive),
    *       `+0.0` should be written into the max statistics field.
    *     - If the computed min value is zero (whether negative or positive),
    *       `-0.0` should be written into the min statistics field.
    */
   1: TypeDefinedOrder TYPE_ORDER;
+
+  /*
+   * The floating point type is ordered according to the totalOrder predicate,
+   * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
+   * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this 
ordering.
+   *
+   * Intuitively, this orders floats mathematically, but defines -0 to be less
+   * than +0, -NaN to be less than anything else, and +NaN to be greater than
+   * anything else. It also defines an order between different bit 
representations
+   * of the same value.
+   *
+   * The formal definition is as follows:
+   *   a) If x<y, totalOrder(x, y) is true.
+   *   b) If x>y, totalOrder(x, y) is false.
+   *   c) If x=y:
+   *     1) totalOrder(−0, +0) is true.
+   *     2) totalOrder(+0, −0) is false.
+   *     3) If x and y represent the same floating-point datum:
+   *        i) If x and y have negative sign, totalOrder(x, y) is true if and
+   *           only if the exponent of x ≥ the exponent of y
+   *       ii) otherwise totalOrder(x, y) is true if and only if the exponent
+   *           of x ≤ the exponent of y.
+   *   d) If x and y are unordered numerically because x or y is NaN:
+   *     1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
+   *        negative sign bit and y is a non-NaN floating-point number.
+   *     2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
+   *        positive sign bit and x is a non-NaN floating-point number.
+   *     3) If x and y are both NaNs, then totalOrder reflects a total ordering
+   *        based on:
+   *         i) negative sign orders below positive sign
+   *        ii) signaling orders below quiet for +NaN, reverse for −NaN
+   *       iii) lesser payload, when regarded as an integer, orders below
+   *            greater payload for +NaN, reverse for −NaN.
+   *
+   * Note that this ordering can be implemented efficiently in software by 
bit-wise
+   * operations on the integer representation of the floating point values.
+   * E.g., this is a possible implementation for DOUBLE in Rust:
+   *
+   *   pub fn totalOrder(x: f64, y: f64) -> bool {
+   *     let mut x_int = x.to_bits() as i64;
+   *     let mut y_int = y.to_bits() as i64;
+   *     x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
+   *     y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
+   *     return x_int <= y_int;
+   *   }
+   *
+   * When writing statistics for columns with this order, the following rules
+   * must be followed:
+   * - Writing the nan_count field is mandatory when using this ordering.
+   * - Min and max statistics must contain the smallest and largest non-NaN
+   *   values respectively, or if all non-null values are NaN, the smallest and
+   *   largest NaN values as defined by IEEE 754 total order.
+   *
+   * When reading statistics for columns with this order, the following rules
+   * should be followed:
+   * - Readers should consult the nan_count field to determine whether NaNs
+   *   are present.
+   * - A reader can compute nan_count + null_count == num_values to deduce

Review Comment:
   while not inconsistent, we should probably give guidance on whether NaN 
should be written to optional statistics fields (and whether readers should 
make use of this fact).  I wonder if we say writers shouldn't populate NaN for 
optional stats fields and always use nan_count to make this determination?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to