[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Thu, 23 Mar 2023 03:58:46 -0700


JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146008777



##########
src/main/thrift/parquet.thrift:
##########
@@ -886,16 +888,23 @@ union ColumnOrder {
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
    * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
-   *     compatibility rules should be applied when reading statistics:
-   *     - If the min is a NaN, it should be ignored.
-   *     - If the max is a NaN, it should be ignored.
+   *     point values (relations vs. total ordering), the following 
compatibility
+   *     rules should be applied when reading statistics:
+   *     - If the nan_count field is set to > 0 and both min and max are
+   *       NaN, a reader can rely on that all non-NULL values are NaN
+   *     - Otherwise, if the min or the max is a NaN, it should be ignored.
+   *     - When looking for NaN values, min and max should be ignored;
+   *       if the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
-   *     - When looking for NaN values, min and max should be ignored.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - The nan_count fields should always be set for FLOAT and DOUBLE 
columns.

Review Comment:
   As it is written, it would rather be a suggestion than a mandate. Note that 
I chose the word "should", i.e., they should do it to make the stats more 
viable for filtering (that's what my commit is about). They *can* opt to not do 
it (especially if they are a legacy writer that doesn't know about the field, 
yet), but then the statistics will be less useful. I guess compelling a writer 
to write statistics in a way that they are most useful is a good idea.
   
   Is the misunderstanding here wheter it is mandated or just suggested, or 
would you also not suggest it at all?



##########
src/main/thrift/parquet.thrift:
##########
@@ -886,16 +888,23 @@ union ColumnOrder {
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
    * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
-   *     compatibility rules should be applied when reading statistics:
-   *     - If the min is a NaN, it should be ignored.
-   *     - If the max is a NaN, it should be ignored.
+   *     point values (relations vs. total ordering), the following 
compatibility
+   *     rules should be applied when reading statistics:
+   *     - If the nan_count field is set to > 0 and both min and max are
+   *       NaN, a reader can rely on that all non-NULL values are NaN
+   *     - Otherwise, if the min or the max is a NaN, it should be ignored.
+   *     - When looking for NaN values, min and max should be ignored;
+   *       if the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
-   *     - When looking for NaN values, min and max should be ignored.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - The nan_count fields should always be set for FLOAT and DOUBLE 
columns.

Review Comment:
   As it is written, it would rather be a suggestion than a mandate. Note that 
I chose the word "should", i.e., they should do it to make the stats more 
viable for filtering (that's what my commit is about). They *can* opt to not do 
it (especially if they are a legacy writer that doesn't know about the field, 
yet), but then the statistics will be less useful. I guess compelling a writer 
to write statistics in a way that they are most useful is a good idea.
   
   Is the misunderstanding here whether it is mandated or just suggested, or 
would you also not suggest it at all?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to