wombatu-kun opened a new pull request, #16417:
URL: https://github.com/apache/iceberg/pull/16417

   ## Problem
   
   Fixes #11950 — https://github.com/apache/iceberg/issues/11950
   
   `MetricsConfig` keys per-column metrics modes by the **original** Iceberg 
column name, while `MetricsUtil.metricsMode(schema, config, fieldId)` resolves 
the field id to a name using whatever schema it is handed. When metrics are 
computed from a Parquet **footer** — the `add_files` / `migrate` / `snapshot` 
path via `ParquetUtil.fileMetrics` (`TableMigrationUtil`, the Delta→Iceberg 
snapshot action) — the schema is reconstructed from the file and carries 
**sanitized** names (`AvroSchemaUtil.makeCompatibleName`, e.g. `$event_time` → 
`_x24event_time`). The name lookup then misses and silently falls back to the 
default mode, which is `none` once a table exceeds 
`write.metadata.metrics.max-inferred-column-defaults` (default 100). Column 
bounds and value counts are silently dropped for escaped columns, degrading 
scan pruning. Explicit `write.metadata.metrics.column.<escaped-name>` overrides 
are ignored on the same path.
   
   The write path was already correct because `ParquetWriter.metrics()` passes 
the writer's original Iceberg schema; only the footer-reconstructed path was 
affected.
   
   ## Fix
   
   Alongside the original-name column modes, `MetricsConfig` now derives a 
separate alias map keyed by the **sanitized** form of each column name 
(sanitizing each dotted path component). `columnMode` falls back to this alias 
map, so a mode resolves whether the caller passes the original name (write 
path) or the sanitized name (footer path). The alias map is intentionally kept 
separate from `columnModes` so `validateReferencedColumns` keeps validating 
only user-supplied original names against the table schema. The change is 
confined to the private constructor, so every factory path benefits and there 
is no public API change (revapi clean).
   
   ## Tests
   
   - `TestMetricsModes`: inferred-default and explicit-override cases for an 
escaped column, asserting `columnMode` resolves by both the original and the 
sanitized name.
   - `TestParquet`: end-to-end regression that writes a Parquet file with an 
escaped column and verifies `ParquetUtil.fileMetrics` collects value counts and 
bounds for it.
   
   Closes #11950
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to