RussellSpitzer commented on issue #4930: URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1170378592
@karuppayya So the issue here is solely for files which do not have fieldIds. These files are read using what we call a default name mapping specified : https://iceberg.apache.org/spec/#column-projection > Tables may also define a property schema.name-mapping.default with a JSON name mapping containing a list of field mapping objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain > > names: A required list of 0 or more names for a field. > field-id: An optional Iceberg field ID used when a field’s name is present in names > fields: An optional list of field mappings for child field of structs, maps, and lists. To walk through this whole scenario. Imagine I have a table `A, B, C` which iceberg will internally note as Fields (0, 1, 2) I import a file (HiveFile) from hive that contains `A, B, C`. This triggers (at least in some of our actions like Migrate or Snapshot) the creation of a Name Mapping. ``` { A -> 0, B -> 1, C -> 2 } ``` When reading HiveFile I see that I only have column names, no field IDs are present in the File metadata. So I use the mapping specified above to say which columns actually belong to which fields. A file written by any Iceberg writer (IcebergFile) would contain embedded in it the mapping of this particular file. ``` Footer { { iceberg.schema { A -> 0, B -> 1, C -> 2 }} ``` Now when we drop C and ADD C our new Table still has names `A, B, C` but fields `A -> 0, B-> 1, C->3` So for IcebergFile we don't have a problem. Because we look at that file and see it only has field "2" not "3" so it can't possibly have values for our new C (3). Now the Table's name mapping should have changed here. The C that we originally mapped "C -> 2" was dropped so we should no longer map the column from the Hive table. But I believe our current behavior is to change the name mapping so that the name mapping also contains "C -> 3". This is incorrect -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
