Fokko commented on code in PR #13398:
URL: https://github.com/apache/iceberg/pull/13398#discussion_r2183634318
##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -69,13 +71,26 @@ public StrictMetricsEvaluator(Schema schema, Expression
unbound, boolean caseSen
* otherwise.
*/
public boolean eval(ContentFile<?> file) {
- // TODO: detect the case where a column is missing from the file using
file's max field id.
+ if (file.valueCounts() != null) {
+ int maxFieldId = file.valueCounts().keySet().stream().mapToInt(i ->
i).max().orElse(0);
Review Comment:
What I tried to say in my previous comment, is that I think that taking the
max is not as solid as just building a set of all the IDs that are in the
DataFile. When you have a table that adds a lot of new columns, then there
might be gaps in each of the file, and the max would not be the optimal
solution. One example being, when you replace the schema with something else
(eg `CREATE OR REPLACE TABLE`), then the new file will have a high max, but
won't contain any of the fields before the `REPLACE` operation.
I think it might be good to revive an old thread on the dev-list that
suggested to add `schema-id` to the DataFile:
https://lists.apache.org/thread/88md2fdk17k26cl4gj3sz6sdbtwcgbk5
That would also solve another issue that's on my mind: We have the new
`UnknownType` that isn't materialized in the files. This would mean if you
query on one of these columns, all the parquet files would be opened (since
there are no statistics for it). I don't think we have any logic for that today
in the Java implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]