I have made changes which serve as a POC implementation to verify the 
improvements.
PR: https://github.com/apache/iceberg/pull/15252
Core improvement / justification:
Before schema evolution: 6 fields in schema – file 1
After schema evolution: 18 fields in schema
When querying field 10 with isNull or notNaN:

  *
Existing behavior in StrictMetricsEvaluator: ROWS_MIGHT_NOT_MATCH
  *
With maxFieldId: returns ROWS_MUST_MATCH

For other operations on field 10:

  *
Existing behavior: ROWS_MIGHT_NOT_MATCH
  *
With maxFieldId: same result, but with early exit

Similar behavior applies to InclusiveMetricsEvaluator.

I understand this would require adding a new field to manifest files ( Iceberg 
specification change). I’d appreciate the community’s view on whether this 
improvement justifies that.
If maxFieldId can instead be derived from the schema used to write the file, 
without adding it to DataFile, I would be happy to explore that direction.

Reply via email to