Fokko commented on code in PR #13398:
URL: https://github.com/apache/iceberg/pull/13398#discussion_r2282379020
##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -69,13 +71,26 @@ public StrictMetricsEvaluator(Schema schema, Expression
unbound, boolean caseSen
* otherwise.
*/
public boolean eval(ContentFile<?> file) {
- // TODO: detect the case where a column is missing from the file using
file's max field id.
+ if (file.valueCounts() != null) {
+ int maxFieldId = file.valueCounts().keySet().stream().mapToInt(i ->
i).max().orElse(0);
Review Comment:
@manirajv06 Since this is now a spec-change, it requires [discussion on the
dev-list](https://iceberg.apache.org/contribute/#merging-pull-requests):
> Changes to files under the format directory and open-api/rest-catalog* are
considered specification changes. Unless already covered under an Iceberg
improvement proposal, specification changes require their own vote (e.g. bug
fixes or specification clarifications). The vote follows the ASF [code
modification](https://www.apache.org/foundation/voting.html#votes-on-code-modification)
model and no lazy consensus modifier. Grammar, spelling and minor formatting
fixes are exempted from this rule. Draft specifications (new independent
specifications that are going through the Iceberg improvement process) do not
require a vote but authors should provide notice on the developer mailing list
about substantive changes (the final draft will be subject to a vote).
This is also a good way to get more eyes on the PR. I think this is the only
way to avoid fetching the footer in the case when you filter on a column that
isn't present in the file, however, it comes at a cost. Now we also have to
distribute the schemas to the (distributed) Parquet readers.
> Follow up Question: Can we guarantee that all schema columns would be
present in the data file? Schema ID 1 linked to data files 1 to 100. Schema ID
2 linked to data files 101 to 200. Schema ID 1 has three columns: a, b, c.
Schema ID 2 has 2 columns: d, e. Is it guaranteed that all data files 1 to 100
would have all three columns a, b, c? (or) Could there be situation where data
files 1 to 50 has only two columns a, b just c is a optional column and 51 to
100 has all three columns?
Typically you write all the columns in the schema to the file, but I think
when a column would be missing, Iceberg will just project the `initial-default`
value in there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]