Fokko commented on code in PR #13398:
URL: https://github.com/apache/iceberg/pull/13398#discussion_r2183634318


##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -69,13 +71,26 @@ public StrictMetricsEvaluator(Schema schema, Expression 
unbound, boolean caseSen
    *     otherwise.
    */
   public boolean eval(ContentFile<?> file) {
-    // TODO: detect the case where a column is missing from the file using 
file's max field id.
+    if (file.valueCounts() != null) {
+      int maxFieldId = file.valueCounts().keySet().stream().mapToInt(i -> 
i).max().orElse(0);

Review Comment:
   What I tried to say in my previous comment, is that I think that taking the 
max is not as solid as just building a set of all the IDs that are in the 
DataFile. When you have a table that adds a lot of new columns, then there 
might be gaps in each of the file, and the max would not be the optimal 
solution. One example being, when you replace the schema with something else 
(eg `CREATE OR REPLACE TABLE`), then the new file will have a high max, but 
won't contain any of the fields before the `REPLACE` operation.
   
   I think it might be good to revive an old thread on the dev-list that 
suggested to add `schema-id` to the DataFile: 
https://lists.apache.org/thread/88md2fdk17k26cl4gj3sz6sdbtwcgbk5
   
   That would also solve another issue that's on my mind: We have the new 
`UnknownType` that isn't materialized in the files. This would mean if you 
query on one of these columns, all the parquet files would be opened (since 
there are no statistics for it). I don't think we have any logic for that today 
in the Java implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to