rdblue opened a new pull request, #9324:
URL: https://github.com/apache/iceberg/pull/9324

   This adds a `SystemConfig` setting to enable or disable unsafe Parquet ID 
fallback.
   
   Unsafe Parquet ID fallback (`ParquetSchemaUtil.pruneColumnsFallback`) is 
Netflix-specific behavior that assigns column IDs by position if a file is 
missing IDs and there is no name mapping. This pre-dates the public release 
(a5eb3f6ba171ecfc517a4f09ae9654e7d8ae0291). This is only applicable for Netflix 
datasets that resolved columns by position in Parquet and could not use name 
mapping.
   
   This PR is the first step to removing fallback ID assignment by adding an 
environment config that can enable it. We need to remove fallback ID assignment 
because it is not safe. For data from Spark or other systems with name-based 
evolution, fallback ID assignment will produce incorrect results when data file 
schemas have changed (dropped a column). Also, integrations have recently used 
this behavior accidentally by producing incorrect Parquet files with no field 
IDs. Removing this will help Iceberg producers to adhere to the spec.
   
   Currently, fallback assignment behavior is triggered when a `NameMapping` is 
null and a file has no field IDs. This adds the ability to disable fallback 
assignment by passing an empty `NameMapping` that will result in the original 
file schema after mapping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to