I have raised a PR for this, if anyone can review it. Pr: https://github.com/apache/iceberg/pull/15252
On 2026/02/08 16:00:26 Varun Lakhyani wrote: > Hi all, > > I’d like to start a discussion around handling of missing columns in StrictMetricsEvaluator, particularly in the presence of schema evolution. > I have an open PR [1] that addresses an existing TODO in StrictMetricsEvaluator by using maxFieldId of the schema used to write a data file to reason > about missing columns. > The current proposal in the PR is: > > * Introduce a maxFieldId on DataFile (currently as a proof of concept). > * During metrics evaluation: > If a column’s field id is greater than the file’s maxFieldId, it implies the column did not exist when the file was written. > For such columns: > * isNull and isNotNaN return ROWS_MUST_MATCH. > * Other operations conservatively return ROWS_MIGHT_NOT_MATCH. > * Unit tests are added to validate the expected behavior. > > At the moment, this is intentionally limited in scope: > > * maxFieldId is only wired through tests. > * There is no inference or propagation of this value during write, > * metadata loading, or manifest reading. > > I’d like feedback on: > > * Whether a file-level maxFieldId is the right abstraction for reasoning about missing columns after schema evolution. > * Where this information should ideally be derived and stored. > * Whether there are existing issues or prior discussions related to similar approaches. > * If this is a direction the community would want to see explored further. > > PR for reference: > [1] https://github.com/apache/iceberg/pull/15252 > > Thanks, > Varun > > Lakhyani Varun Indian Institute of Technology Roorkee Contact: +91 96246 46174
