[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

ASF GitHub Bot (Jira) Fri, 30 Jun 2023 03:46:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739025#comment-17739025
 ]


ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614476748

   > I wonder for PageIndex pruning in Rust implementions
   
   Currently the arrow-rs implementation uses the totalOrder predicate as 
defined by the IEEE 754 (2008 revision) floating point standard to order 
floats, this can be very efficiently implemented using some bit-twiddling and 
at least appears to define the standardised way to handle this. I believe 
DataFusion is using these same comparison kernels for evaluating pruning 
predicates, and so I would expect it to have similar behaviour with regards to 
NaNs.
   
   From the [Rust 
docs](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp):
   
   > The values are ordered in the following sequence:
   > 
   >     negative quiet NaN
   >     negative signaling NaN
   >     negative infinity
   >     negative numbers
   >     negative subnormal numbers
   >     negative zero
   >     positive zero
   >     positive subnormal numbers
   >     positive numbers
   >     positive infinity
   >     positive signaling NaN
   >     positive quiet NaN.
   
   > would it matter for adding [-inf, +inf] as min-max for all nan and null 
pages
   
   I haven't read the full backscroll, but the original PR's suggestion of just 
writing a NaN for a page only containing NaN seems perfectly logical to me, 
unlikely to cause compatibility issues, and significantly less surprising than 
writing a value that doesn't actually appear in the data...
   
   > Let's cc some of the maintainers of 
[parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet):
   
   I don't really know enough about the history of floating point comparison to 
weigh in on what the best solution is with any degree of authority, however, my 
2 cents is that the totalOrder predicate is the standardised way to handle this.
   
   Whilst I do agree that the behaviour of aggregate statistics containing NaNs 
might be unfortunate for some workloads, I'm not sure that special casing them 
is beneficial. Aside from the non-trivial additional complexity associated with 
special-casing them, if you don't include NaNs in statistics it is unclear to 
me how you can push down a comparison predicate as you have no way to know if 
the page contains NaNs? Perhaps that is what this PR seeks to address, but I do 
wonder if the simple solution might be worth considering...
   
   Also tagging @crepererum who may have further thoughts




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

Reply via email to