[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

ASF GitHub Bot (Jira) Wed, 21 Jun 2023 11:31:05 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735855#comment-17735855
 ]


ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381221


##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list<bool> nan_pages

Review Comment:
   Yes, number of rows in the offset index isn't enough due to repeated values.
   
   Apart from this, the suggestions seem to turn a bit in circles now. Note 
that all suggestions in this thread were already mentioned in [my earlier post 
where I depicted our possible options for the column 
index](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762).
   
   @pitrou what you mentioned was my Option 2. I personally would prefer this 
as it feels like a useful thing to have anyway. Having said that, others 
pointed rightfully out that it would cost a few bytes even for non float 
columns. The value might be valuable for other tasks as well. For example, it 
could be used to quickly check how many nested values are in a page. By having 
these values one could sum up the nested values per column chunk by adding up 
all the value counts. This is currently a value that cannot be optained at all 
through statistics; instead one has to decode pages and count. For example, the 
SQL query `SELECT count(*) FROM some_nested_column;` could be fully answered 
with such a value_counts field.
   
   @wgtmac your proposal was my Option 1 and actually my initial proposal (see 
previous commit). Note that you 
[earlier](https://github.com/apache/parquet-format/pull/196#pullrequestreview-1362171450)
 actually were against writing NaNs and rather preferred the nan_pages approach:
   
   > Personally speaking, apart from adding a nan_count to the statistics, I 
would go with the option 3: adding a nan_pages bool list to the column index. I 
am not in favor of writing any NaN to min/max bounds.
   
   Is your argument that if we now need to write the NaNs anyway, that we 
should in this case just use them instead of adding nan_pages? I do agree that 
this would save the extra field and I personally see nothing wrong in doing 
this. Readers need to be able to detect NaN values anyway (to ignore them), so 
readers should be able to use the same logic to determin min=max=NaN <=> all 
values are NaN.
   
   As mentioned in my previous post where I compared the three approaches, I am 
happy to implement any of them and I think all of them will fulfill the 
requirements. In my personal opinion, I like the current approach with 
nan_pages actually the least, as it seems redundant if we have to write NaN 
values anyway and I see no problem in using NaN values for the "all values NaN 
check".
   
   I also like the option of adding a value_counts field to the column index of 
all columns. It feels like a useful and missing field (that is not subsumed by 
offset index row counts for nested columns) and I would love to add it as well 
and I feel the few extra bytes will be so negligible in contrast to the actual 
data that no-one will ever care. Also it would enable us to do the check for 
all values NaN the same way in page statistics and in the column index.
   
   So we're back at the three options I proposed:
   
   1. Drop nan_pages and use my initial approach of "min=max=NaN <=> all values 
are NaN" in the column index
   2. Drop nan_pages and instead add value_counts so we can use 
value_counts-null_counts==nan_counts to determine whether all values are null. 
(My personal favorite)
   3. Retain the current state and use `nan_pages`
   
   @wgtmac @mapleFU @gszadovszky @pitrou  could we arrive at a consensus here? 
I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't 
mentioned your favorite, yet (you just pointed out that we have to write some 
valid value).
   
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

Reply via email to