Hi all,
I’d like to start a discussion around handling of missing columns in
StrictMetricsEvaluator, particularly in the presence of schema evolution.
I have an open PR [1] that addresses an existing TODO in StrictMetricsEvaluator
by using maxFieldId of the schema used to write a data file to reason
about missing columns.
The current proposal in the PR is:
* Introduce a maxFieldId on DataFile (currently as a proof of concept).
* During metrics evaluation:
If a column’s field id is greater than the file’s maxFieldId, it implies the
column did not exist when the file was written.
For such columns:
* isNull and isNotNaN return ROWS_MUST_MATCH.
* Other operations conservatively return ROWS_MIGHT_NOT_MATCH.
* Unit tests are added to validate the expected behavior.
At the moment, this is intentionally limited in scope:
* maxFieldId is only wired through tests.
* There is no inference or propagation of this value during write,
* metadata loading, or manifest reading.
I’d like feedback on:
* Whether a file-level maxFieldId is the right abstraction for reasoning
about missing columns after schema evolution.
* Where this information should ideally be derived and stored.
* Whether there are existing issues or prior discussions related to similar
approaches.
* If this is a direction the community would want to see explored further.
PR for reference:
[1] https://github.com/apache/iceberg/pull/15252
Thanks,
Varun