RussellSpitzer commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1064958762
##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##########
@@ -560,24 +560,26 @@ private <T> T max(Statistics<?> statistics, int id) {
}
/**
- * Checks against older versions of Parquet statistics which may have a null
count but undefined
- * min and max statistics. Returns true if nonNull values exist in the row
group but no further
- * statistics are available.
- *
- * <p>We can't use {@code statistics.hasNonNullValue()} because it is
inaccurate with older files
- * and will return false if min and max are not set.
+ * Older versions of Parquet statistics which may have a null count but
undefined min and max
+ * statistics. This is similar to the current behavior when NaN values are
present.
*
* <p>This is specifically for 1.5.0-CDH Parquet builds and later which
contain the different
* unusual hasNonNull behavior. OSS Parquet builds are not effected because
PARQUET-251 prohibits
* the reading of these statistics from versions of Parquet earlier than
1.8.0.
*
* @param statistics Statistics to check
- * @param valueCount Number of values in the row group
- * @return true if nonNull values exist and no other stats can be used
+ * @return true if min and max statistics are null
*/
- static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount)
{
- return statistics.getNumNulls() < valueCount
- && (statistics.getMaxBytes() == null || statistics.getMinBytes() ==
null);
+ static boolean nullMinMax(Statistics statistics) {
+ return statistics.getMaxBytes() == null || statistics.getMinBytes() ==
null;
+ }
+
+ static boolean minMaxUndefined(Statistics statistics) {
+ return (!statistics.isEmpty() && !statistics.hasNonNullValue()) ||
nullMinMax(statistics);
Review Comment:
I'm not sure what you mean by
```are we including cases where hasNonNullValue is actually false because
the null count is zero?```
There are only four assignments of `hasNonNullValue`
1. In the Constructor for Statistics : false
2. & 3. In FloatStats and DoubleStats when a NaN is seen : false
4. In markAsNonEmpty() : true
```java
/**
* Sets the page/column as having a valid non-null value
* kind of misnomer here
*/
protected void markAsNotEmpty() {
hasNonNullValue = true;
}
```
This method is invoked in all of the Statistics sub classes when
"setMinMax", "initializeStats" or "setMinMaxFromBytes" are called.
So i'm not sure how hasNonNullValue could be actually false unless we see a
NaN or no min and max are set
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]