[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704107#comment-17704107
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146076533


##########
README.md:
##########
@@ -163,18 +163,25 @@ following rules:
       [Thrift definition](src/main/thrift/parquet.thrift) in the
       `ColumnOrder` union. They are summarized here but the Thrift definition
       is considered authoritative:
-      * NaNs should not be written to min or max statistics fields.
-      * If the computed max value is zero (whether negative or positive),
-        `+0.0` should be written into the max statistics field.
-      * If the computed min value is zero (whether negative or positive),
-        `-0.0` should be written into the min statistics field.
-
-      For backwards compatibility when reading files:
-      * If the min is a NaN, it should be ignored.
-      * If the max is a NaN, it should be ignored.
-      * If the min is +0, the row group may contain -0 values as well.
-      * If the max is -0, the row group may contain +0 values as well.
-      * When looking for NaN values, min and max should be ignored.
+      * The following compatibility rules should be applied when reading 
statistics:
+        * If the nan_count field is set to > 0 and both min and max are

Review Comment:
   @mapleFU To your general comment (I can't answer there)
   
   > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can 
decide nan by min-max. Can we just decide it by `null_count + nan_count == 
num_values`?
   
    The problem is that the ColumnIndex does not have the `num_values` field, 
so using this computation to derive whether there are only NaNs would only be 
applicable to Statistics, not to the column index. Of course, we could do what 
I suggested in alternatives and give the column index a `num_values` list. Then 
this would indeed work everywhere but at the cost of an additional list.
   
   So I see we have the following options:
   * Do what I did here, i.e., use min/max to determine whether there are only 
NaNs
   * Add a `num_values` list to the ColumnIndex
   * Accept the fact that the column index cannot detect only-NaN columns
   * Tell readers to use the `min==max==NaN` reasoning only in the column 
index, and use the `null_count + nan_count == num_values` for the statistics.
   
   Which one would you suggest here?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to