gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243814135
##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+
+ /**
+ * A list of Boolean values to determine pages that contain only NaNs. Only
+ * present for columns of type FLOAT and DOUBLE. If true, all non-null
+ * values in a page are NaN. Writers are suggested to set the corresponding
+ * entries in min_values and max_values to NaN, so that all lists have the
same
+ * length and contain valid values. If false, then either all values in the
+ * page are null or there is at least one non-null non-NaN value in the page.
+ * As readers are supposed to ignore all NaN values in bounds, legacy readers
+ * who do not consider nan_pages yet are still able to use the column index
+ * but are not able to skip only-NaN pages.
+ */
+ 6: optional list<bool> nan_pages
Review Comment:
@JFinis, your idea sounds good but it is not that easy, unfortunately. Since
no total ordering is specified NaN values can get before negative infinity or
after positive infinity. An implementation that currently writes NaN values to
column indexes will break in this scenario.
@pitrou, I've brought up boundary order because that was our original answer
to the problems of these ordering issues. NaN values are not the only potential
issues around ordering. E.g. how should we order internationalized UTF-8
strings?
I agree that the current parquet-mr implementation of handling NaN values in
column indexes is not correct. But it also means we cannot do this change
without breaking older parquet-mr readers. Boundary order would solve this from
parquet-mr point of view but if it is not used by other implementations it is
not a good choice on its own either.
If there are parquet files with column indexes containing NaN values and we
consider them valid then we need to fix this issue in parquet-mr and it is
unrelated to this format change. However, it is not an easy question if they
are really valid. Are both min and max are NaN? If not what is the total
ordering in that system which writes these files? Can this format change be
compatible with that system?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]