Re: [PR] Detect the case to identify missing column from the file using file's max field id in StrictMetricsEvaluator #13397 [iceberg]

via GitHub Mon, 18 Aug 2025 06:21:52 -0700


Fokko commented on code in PR #13398:
URL: https://github.com/apache/iceberg/pull/13398#discussion_r2282379020



##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -69,13 +71,26 @@ public StrictMetricsEvaluator(Schema schema, Expression 
unbound, boolean caseSen
    *     otherwise.
    */
   public boolean eval(ContentFile<?> file) {
-    // TODO: detect the case where a column is missing from the file using 
file's max field id.
+    if (file.valueCounts() != null) {
+      int maxFieldId = file.valueCounts().keySet().stream().mapToInt(i -> 
i).max().orElse(0);

Review Comment:
   @manirajv06 Since this is now a spec-change, it requires [discussion on the 
dev-list](https://iceberg.apache.org/contribute/#merging-pull-requests):
   
   > Changes to files under the format directory and open-api/rest-catalog* are 
considered specification changes. Unless already covered under an Iceberg 
improvement proposal, specification changes require their own vote (e.g. bug 
fixes or specification clarifications). The vote follows the ASF [code 
modification](https://www.apache.org/foundation/voting.html#votes-on-code-modification)
 model and no lazy consensus modifier. Grammar, spelling and minor formatting 
fixes are exempted from this rule. Draft specifications (new independent 
specifications that are going through the Iceberg improvement process) do not 
require a vote but authors should provide notice on the developer mailing list 
about substantive changes (the final draft will be subject to a vote).
   
   This is also a good way to get more eyes on the PR. I think this is the only 
way to avoid fetching the footer in the case when you filter on a column that 
isn't present in the file, however, it comes at a cost. Now we also have to 
distribute the schemas to the (distributed) Parquet readers.
   
   > Follow up Question: Can we guarantee that all schema columns would be 
present in the data file? Schema ID 1 linked to data files 1 to 100. Schema ID 
2 linked to data files 101 to 200. Schema ID 1 has three columns: a, b, c. 
Schema ID 2 has 2 columns: d, e. Is it guaranteed that all data files 1 to 100 
would have all three columns a, b, c? (or) Could there be situation where data 
files 1 to 50 has only two columns a, b just c is a optional column and 51 to 
100 has all three columns?
   
   Typically you write all the columns in the schema to the file, but I think 
when a column would be missing, Iceberg will just project the `initial-default` 
value in there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Detect the case to identify missing column from the file using file's max field id in StrictMetricsEvaluator #13397 [iceberg]

Reply via email to