RussellSpitzer commented on a change in pull request #1638:
URL: https://github.com/apache/iceberg/pull/1638#discussion_r509774287



##########
File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
##########
@@ -423,4 +451,24 @@ public Boolean or(Boolean leftResult, Boolean rightResult) 
{
       return (T) conversions.get(id).apply(statistics.genericGetMax());
     }
   }
+
+  /**
+   * Checks against older versions of Parquet statistics which may have a null 
count but undefined min and max
+   * statistics. Returns true if nonNull values exist in the row group but no 
further statistics are available.
+   * <p>
+   * We can't use {@code  statistics.hasNonNullValue()} because it is 
inaccurate with older files and will return
+   * false if min and max are not set.
+   * <p>
+   * This is specifically for 1.5.0-CDH Parquet builds and later which contain 
the different unusual hasNonNull
+   * behavior. OSS Parquet builds are not effected because PARQUET-251 
prohibits the reading of these statistics

Review comment:
       Other types properly report min/max and hasNonNulls works. I'd have to 
go into the fix that CDH did to figure out for sure why their behavior is this 
way but I think they just disabled min/max stats to get around the Parquet-251 
bug. 
   
   Strings are stored as a binary type. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to