RussellSpitzer commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1063527704


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##########
@@ -560,24 +560,26 @@ private <T> T max(Statistics<?> statistics, int id) {
   }
 
   /**
-   * Checks against older versions of Parquet statistics which may have a null 
count but undefined
-   * min and max statistics. Returns true if nonNull values exist in the row 
group but no further
-   * statistics are available.
-   *
-   * <p>We can't use {@code statistics.hasNonNullValue()} because it is 
inaccurate with older files
-   * and will return false if min and max are not set.
+   * Older versions of Parquet statistics which may have a null count but 
undefined min and max
+   * statistics. This is similar to the current behavior when NaN values are 
present.
    *
    * <p>This is specifically for 1.5.0-CDH Parquet builds and later which 
contain the different
    * unusual hasNonNull behavior. OSS Parquet builds are not effected because 
PARQUET-251 prohibits
    * the reading of these statistics from versions of Parquet earlier than 
1.8.0.
    *
    * @param statistics Statistics to check
-   * @param valueCount Number of values in the row group
-   * @return true if nonNull values exist and no other stats can be used
+   * @return true if min and max statistics are null
    */
-  static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) 
{
-    return statistics.getNumNulls() < valueCount
-        && (statistics.getMaxBytes() == null || statistics.getMinBytes() == 
null);
+  static boolean nullMinMax(Statistics statistics) {
+    return statistics.getMaxBytes() == null || statistics.getMinBytes() == 
null;
+  }
+
+  static boolean minMaxUndefined(Statistics statistics) {
+    return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || 
nullMinMax(statistics);

Review Comment:
   This is actually a direct reference to the logic in Parquet Statistics.java 
. I'm doing the same check they do which is "!IsEmpty && !nonNullValue" see
   
   
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L454



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to