Re: [PR] Detect the case to identify missing column from the file using file's max field id in StrictMetricsEvaluator #13397 [iceberg]

via GitHub Sun, 06 Jul 2025 10:07:26 -0700


manirajv06 commented on code in PR #13398:
URL: https://github.com/apache/iceberg/pull/13398#discussion_r2188474388



##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -69,13 +71,26 @@ public StrictMetricsEvaluator(Schema schema, Expression 
unbound, boolean caseSen
    *     otherwise.
    */
   public boolean eval(ContentFile<?> file) {
-    // TODO: detect the case where a column is missing from the file using 
file's max field id.
+    if (file.valueCounts() != null) {
+      int maxFieldId = file.valueCounts().keySet().stream().mapToInt(i -> 
i).max().orElse(0);

Review Comment:
   @Fokko Made changes to link schema id to data file only. It is WIP PR. Since 
this link change is touching many places, committed the changes to get your 
feedback on the overall direction. 
   
   Need to focus on the following:
   
   1. Test coverage
   2. Delete Files.
   
   Follow up Question: Can we guarantee that all schema columns would be 
present in the data file? Schema ID 1 linked to data files 1 to 100. Schema ID 
2 linked to data files 101 to 200. Schema ID 1 has three columns: a, b, c. 
Schema ID 2 has 2 columns: d, e. Is it guaranteed that all data files 1 to 100 
would have all three columns a, b, c? (or) Could there be situation where data 
files 1 to 50 has only two columns a, b just c is a optional column and 51 to 
100 has all three columns?
   I am assuming all three columns would be there with "null" default value for 
that optional column "C". Please confirm.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Detect the case to identify missing column from the file using file's max field id in StrictMetricsEvaluator #13397 [iceberg]

Reply via email to