rdblue opened a new pull request, #9324: URL: https://github.com/apache/iceberg/pull/9324
This adds a `SystemConfig` setting to enable or disable unsafe Parquet ID fallback. Unsafe Parquet ID fallback (`ParquetSchemaUtil.pruneColumnsFallback`) is Netflix-specific behavior that assigns column IDs by position if a file is missing IDs and there is no name mapping. This pre-dates the public release (a5eb3f6ba171ecfc517a4f09ae9654e7d8ae0291). This is only applicable for Netflix datasets that resolved columns by position in Parquet and could not use name mapping. This PR is the first step to removing fallback ID assignment by adding an environment config that can enable it. We need to remove fallback ID assignment because it is not safe. For data from Spark or other systems with name-based evolution, fallback ID assignment will produce incorrect results when data file schemas have changed (dropped a column). Also, integrations have recently used this behavior accidentally by producing incorrect Parquet files with no field IDs. Removing this will help Iceberg producers to adhere to the spec. Currently, fallback assignment behavior is triggered when a `NameMapping` is null and a file has no field IDs. This adds the ability to disable fallback assignment by passing an empty `NameMapping` that will result in the original file schema after mapping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
