[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

ASF GitHub Bot (Jira) Fri, 07 Jul 2023 07:56:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741072#comment-17741072
 ]


ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697

   > > @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this 
proposal?
   > 
   > It's difficult to say without understanding the implications. Say a data 
page contains some NaNs, what happens?
   
   @pitrou On the write path:
   
   * The writing library would set the `ColumnOrder` for this column to the new 
option, let's call it `IEEE754TotalOrder`.
   * The writing library would use IEEE754 total order for all order / sorting 
related tasks. I.e., it would compute the min and max values of the page using 
that total order. That order has a defined place for NaN. The writer would 
*not* have special logic for NaN. It would just order everything using total 
order. E.g., in case of a page containing a positive NaN, this would be chosen 
as the max value, as Nan > everything else in the total order.
   
   On the read path:
   * A reading engine encountering the new `IEEE754TotalOrder` column order 
would either
     a) (legacy reader) not understand it and in this case not use any 
statistics, as it doesn't understand the semantics of the order relation.
     b) (new reader) understand it and assume that all order is in IEEE 754 
total order, which again has a defined place for NaNs. Depending on the NaN 
semantics of the reading engine, it would need to make sure to align the values 
it sees in min/max with its own semantics. How this alignment would look like 
would depend on the semantics of the engine. (I can give more detailed examples 
for different engine semantics if necessary)
     
   Ramifications:
   * PRO: Due to the new column order, legacy readers are guarded. They don't 
need to understand the new order. Even if they ignore the column order, if they 
see NaNs in min and max they will just ignore them and assume the statistics 
aren't usable. So we have two layers of protection to make sure legacy readers 
don't misunderstand the ordering.
   * PRO: No special fields for NaNs. No `nan_counts`, no `nan_pages`. Instead, 
NaNs are just treated as defined in the total ordering.
   * PRO: Simple standardized handling of floatsinstead of special handling of 
NaNs. I guess this was the main point of @tustvold and @crepererum.
   * PRO: Engines only need to understand total ordering and don't need any 
special NaN handling code anymore (unless their semantics is different, in 
which case they need to map their semantics from / to total ordering).
   * CON: NaNs *will* be used in min/max bounds, even for not only-NaN pages. 
This makes them less effective for filtering (as they are the widest possible 
bounds) but @crepererum made a good point that this "special case for NaN" is 
quite arbitrary and we could also special case INT_MAX for integer columns, 
e.g.. I see the argument that keeping the architecture simple might be 
preferrable. Also NaNs are not widely used, so this will not be determinental 
to many data sets.
    
     
     




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

Reply via email to