Hi all,

I’d like to start a discussion around handling of missing columns in 
StrictMetricsEvaluator, particularly in the presence of schema evolution.
I have an open PR [1] that addresses an existing TODO in StrictMetricsEvaluator 
by using maxFieldId of the schema used to write a data file to reason
about missing columns.
The current proposal in the PR is:

  *   Introduce a maxFieldId on DataFile (currently as a proof of concept).
  *   During metrics evaluation:
If a column’s field id is greater than the file’s maxFieldId, it implies the 
column did not exist when the file was written.
For such columns:
     *   isNull and isNotNaN return ROWS_MUST_MATCH.
     *   Other operations conservatively return ROWS_MIGHT_NOT_MATCH.
  *   Unit tests are added to validate the expected behavior.

At the moment, this is intentionally limited in scope:

  *   maxFieldId is only wired through tests.
  *   There is no inference or propagation of this value during write,
  *   metadata loading, or manifest reading.

I’d like feedback on:

  *   Whether a file-level maxFieldId is the right abstraction for reasoning 
about missing columns after schema evolution.
  *   Where this information should ideally be derived and stored.
  *   Whether there are existing issues or prior discussions related to similar 
approaches.
  *   If this is a direction the community would want to see explored further.

PR for reference:
[1] https://github.com/apache/iceberg/pull/15252

Thanks,
Varun

Reply via email to