rdblue commented on a change in pull request #110: Fix Iceberg Parquet Reader 
scanning when filtering on nested types
URL: https://github.com/apache/incubator-iceberg/pull/110#discussion_r259439668
 
 

 ##########
 File path: 
parquet/src/main/java/com/netflix/iceberg/parquet/ParquetMetricsRowGroupFilter.java
 ##########
 @@ -156,6 +157,13 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
       Preconditions.checkNotNull(struct.field(id),
           "Cannot filter by nested column: %s", schema.findField(id));
 
+      // When filtering nested types notNull() is implicit filter passed even
 
 Review comment:
   So Spark will take a filter like `a.b = 12` and attempt to push down 
`notNull(a)`. Is that correct?
   
   Because Parquet only has stats for primitive columns, the value count is 
null and this assumes that the column is missing when it is actually just not a 
leaf of the schema.
   
   I agree with this fix. Are there other cases that also need to be supported 
for correctness? It seems like this should be checked in every visitor method 
(`isNull`, `equal`, `notEqual`, etc.) for correctness.
   
   An alternative is to also check the predicate on all child types, but I 
think this simple solution is fine for now. Maybe we should note in this 
comment that we could do more work to check?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to