I have raised a PR for this, if anyone can review it.
Pr:
https://github.com/apache/iceberg/pull/15252


On 2026/02/08 16:00:26 Varun Lakhyani wrote:
> Hi all,
>
> I’d like to start a discussion around handling of missing columns in
StrictMetricsEvaluator, particularly in the presence of schema evolution.
> I have an open PR [1] that addresses an existing TODO in
StrictMetricsEvaluator by using maxFieldId of the schema used to write a
data file to reason
> about missing columns.
> The current proposal in the PR is:
>
> * Introduce a maxFieldId on DataFile (currently as a proof of concept).
> * During metrics evaluation:
> If a column’s field id is greater than the file’s maxFieldId, it implies
the column did not exist when the file was written.
> For such columns:
> * isNull and isNotNaN return ROWS_MUST_MATCH.
> * Other operations conservatively return ROWS_MIGHT_NOT_MATCH.
> * Unit tests are added to validate the expected behavior.
>
> At the moment, this is intentionally limited in scope:
>
> * maxFieldId is only wired through tests.
> * There is no inference or propagation of this value during write,
> * metadata loading, or manifest reading.
>
> I’d like feedback on:
>
> * Whether a file-level maxFieldId is the right abstraction for reasoning
about missing columns after schema evolution.
> * Where this information should ideally be derived and stored.
> * Whether there are existing issues or prior discussions related to
similar approaches.
> * If this is a direction the community would want to see explored further.
>
> PR for reference:
> [1] https://github.com/apache/iceberg/pull/15252
>
> Thanks,
> Varun
>
>




Lakhyani Varun
Indian Institute of Technology Roorkee
Contact: +91 96246 46174

Reply via email to