JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762

   The gist of all opened issues is the question how to encode pages/column 
chunks that contain only NaNs. 
   
   This is actually only an issue for the `ColumnIndex`. For statistics in the 
`ColumnMetaData` or the page, we can find only-Nan pages/columnChunks by 
computing `num_values - null_count - nan_count == 0`. The `ColumnIndex` doesn't 
have `num_values`, so we can't perform this computation.
   
   I see three alternatives to handle the problem in the `ColumnIndex`:
   * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
   * Adding `num_values` to the ColumnIndex, to make it symmetric with 
Statistics in pages & `ColumnMetaData` and to enable the computation 
`num_values - null_count - nan_count == 0`
   * Adding a `nan_pages` bool list to the column index, which indicates 
whether a page contains only NaNs
   
   **I'm fine with either of these, so I would like us to reach a consensus for 
one of the alternatives here; then I can update my PR to reflect the decision. 
As this is my first contribution to parquet, I don't know the decision 
processes here. Do we vote? Is there a single or group of decision makers? 
Please let me know how to come to a conclusion here.**
   
   As a help for the decision: Here are again the PROs and CONs of the three 
alternativs:
   
   * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
     * **PRO:** Fully backward compatible
     * **PRO:** Needs no further lists in the ColumnIndex
     * **CON:** people are uneasy with storing NaNs in bounds, due to many 
existing bit patterns and therefore a bit fuzzy semantics.
   * Adding `num_values` to the ColumnIndex, to make it symmetric with 
Statistics in pages & `ColumnMetaData` and to enable the computation 
`num_values - null_count - nan_count == 0`
     * **PRO:** No NaNs in bounds, no encoding/bit-pattern fuzzyness
     * **PRO:** Makes the `ColumnIndex` symmetric to other statistics (and to 
Apache Iceberg)
     * **PRO:** The `num_values` would also be viable for other purposes. It 
feels weirdly asymmetric to not have this field in the column index. For 
example, this would help to gauge the number of nested values in a nested 
column.
     * **CON:** The extra `num_values` list would be in each column index, even 
for non FLOAT/DOUBLE columns, thereby adding space consumption and 
encoding/decoding overhead.
     * **CON:** Would make `null_pages` redundant, as `null_pages[i] == 
(num_values[i] - null_count[i] == 0)`
     * **CON:** In theory not 100% backward compatible, but probably not 
relevant in practice*
   * Adding a `nan_pages` bool list to the column index, which indicates 
whether a page contains only NaNs
     * **PRO:** No NaN encoding fuzzyness, no encoding/bit-pattern fuzzyness
     * **PRO:** Less space consumption than `num_values`. The list would only 
be present for FLOAT/DOUBLE columns
     * **PRO:** Along the lines of `null_pages` so following an existing 
pattern in the column index
     * **CON:** In theory not 100% backward compatible, but probably not 
relevant in practice*
     
   \* Explanation of "in theory not 100% backward compatible": Today, min and 
max in a column index have to have a valid value unless `null_pages` of the 
respective page is true. This would no longer hold if we decide to encode 
only-NaN pages through empty min/max + `nan_pages` or empty min/max + 
`num_values`. Thus, a legacy reader, who doesn't know the new lists, could come 
to the conclusion that the missing bounds constitute an invalid ColumnIndex and 
therefore might deem the whole Parquet file as invalid. I doubt that this is a 
problem in practice, as readers are written leniently. I.e., if a missing bound 
in a column index is encountered, the index might not be used (what would 
already happen today in case of an only-NaN page, so not a regression) or just 
that page might be treated as "has to be scanned". I don't know a reader that 
would reject the whole Parquet file in this case. Therefore, this is likely not 
relevant in practice.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to