szehon-ho commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1063893350
##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##########
@@ -560,24 +560,26 @@ private <T> T max(Statistics<?> statistics, int id) {
}
/**
- * Checks against older versions of Parquet statistics which may have a null
count but undefined
- * min and max statistics. Returns true if nonNull values exist in the row
group but no further
- * statistics are available.
- *
- * <p>We can't use {@code statistics.hasNonNullValue()} because it is
inaccurate with older files
- * and will return false if min and max are not set.
+ * Older versions of Parquet statistics which may have a null count but
undefined min and max
+ * statistics. This is similar to the current behavior when NaN values are
present.
*
* <p>This is specifically for 1.5.0-CDH Parquet builds and later which
contain the different
* unusual hasNonNull behavior. OSS Parquet builds are not effected because
PARQUET-251 prohibits
* the reading of these statistics from versions of Parquet earlier than
1.8.0.
*
* @param statistics Statistics to check
- * @param valueCount Number of values in the row group
- * @return true if nonNull values exist and no other stats can be used
+ * @return true if min and max statistics are null
*/
- static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount)
{
- return statistics.getNumNulls() < valueCount
- && (statistics.getMaxBytes() == null || statistics.getMinBytes() ==
null);
+ static boolean nullMinMax(Statistics statistics) {
+ return statistics.getMaxBytes() == null || statistics.getMinBytes() ==
null;
+ }
+
+ static boolean minMaxUndefined(Statistics statistics) {
+ return (!statistics.isEmpty() && !statistics.hasNonNullValue()) ||
nullMinMax(statistics);
Review Comment:
Sounds good to me, I was just doing a sanity check by exploding the logic
(!statistics.hasNonNullValue() && !isNumNullsSet) ||
!statistics.hasNonNullValue() , and it checks out. I think I saw your comment
of toString() and was not sure if its that reliable as a source of truth.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]