asfimport opened a new issue, #406:
URL: https://github.com/apache/parquet-format/issues/406

   Currently, the specification of `ColumnIndex` in `parquet.thrift` is 
inconsistent, leading to cases where it is impossible to create a parquet file 
that is conforming to the spec.
   
   The problem is with double/float columns if a page contains only NaN values. 
The spec mentions that NaN values should not be included in min/max bounds, so 
a page consisting of only NaN values has no defined min/max bound. To quote the 
spec:
   
   
   ```
   
      *     When writing statistics the following rules should be followed:
      *     - NaNs should not be written to min or max statistics fields.
   ```
   
   However, the comments in the ColumnIndex on the null_pages member states the 
following:
   
   
   ```
   
   struct ColumnIndex {
     /**
      * A list of Boolean values to determine the validity of the corresponding
      * min and max values. If true, a page contains only null values, and 
writers
      * have to set the corresponding entries in min_values and max_values to
      * byte[0], so that all lists have the same length. If false, the
      * corresponding entries in min_values and max_values must be valid.
      */
     1: required list<bool> null_pages
   ```
   
   For a page with only NaNs, we now have a problem. The page definitly does 
**not** only contain null values, so `null_pages` should be `false` for this 
page. However, in this case the spec requires valid min/max values in 
`min_values` and `max_values` for this page. As the only value in the page is 
NaN, the only valid min/max value we could enter here is NaN, but as mentioned 
before, NaNs should never be written to min/max values.
   
   Thus, no writer can currently create a parquet file that conforms to this 
specification as soon as there is a only-NaN column and column indexes are to 
be written.
   
   I see three possible solutions:
   1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
null_pages entry set to {**}true{**}.
   2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
`byte[0]` as min/max, even though the null_pages entry is set to {**}false{**}.
   3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
NaN as min & max in the column index.
   
   None of the solutions is perfect. But I guess solution 3. is the best of 
them. It gives us valid min/max bounds, makes null_pages compatible with this, 
and gives us a way to determine only-Nan pages (min=max=NaN).
   
   As a general note: I would say that it is a shortcoming that Parquet doesn't 
track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
have this inconsistency. In a future version, NaN counts could be introduced, 
but that doesn't help for backward compatibility, so we do need a solution for 
now.
   
   Any of the solutions is better than the current situation where engines 
writing such a page cannot write a conforming parquet file and will randomly 
pick any of the solutions.
   
   Thus, my suggestion would be to update parquet.thrift to use solution 3. 
I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
bounds by adding a clause stating that "if a page contains only NaNs or a 
mixture of NaNs and NULLs, then NaN should be written as min & max".
   
    
   
   **Reporter**: [Jan 
Finis](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jfinis) / 
@jfinis
   #### PRs and other links:
   - [GitHub Pull Request 
#221](https://github.com/apache/parquet-format/pull/221)
   
   <sub>**Note**: *This issue was originally created as 
[PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249). Please see 
the [migration 
documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further 
details.*</sub>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to