RussellSpitzer commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509774287
##########
File path:
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
##########
@@ -423,4 +451,24 @@ public Boolean or(Boolean leftResult, Boolean rightResult)
{
return (T) conversions.get(id).apply(statistics.genericGetMax());
}
}
+
+ /**
+ * Checks against older versions of Parquet statistics which may have a null
count but undefined min and max
+ * statistics. Returns true if nonNull values exist in the row group but no
further statistics are available.
+ * <p>
+ * We can't use {@code statistics.hasNonNullValue()} because it is
inaccurate with older files and will return
+ * false if min and max are not set.
+ * <p>
+ * This is specifically for 1.5.0-CDH Parquet builds and later which contain
the different unusual hasNonNull
+ * behavior. OSS Parquet builds are not effected because PARQUET-251
prohibits the reading of these statistics
Review comment:
Other types properly report min/max and hasNonNulls works. I'd have to
go into the fix that CDH did to figure out for sure why their behavior is this
way but I think they just disabled min/max stats to get around the Parquet-251
bug.
Strings are stored as a binary type.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]