[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

GitBox Wed, 29 Jun 2022 12:01:11 -0700


RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1170378592


   @karuppayya So the issue here is solely for files which do not have 
fieldIds. These files are read using what we call a default name mapping 
specified : https://iceberg.apache.org/spec/#column-projection
   
   > Tables may also define a property schema.name-mapping.default with a JSON 
name mapping containing a list of field mapping objects. These mappings provide 
fallback field ids to be used when a data file does not contain field id 
information. Each object should contain
   > 
   > names: A required list of 0 or more names for a field.
   > field-id: An optional Iceberg field ID used when a field’s name is present 
in names
   > fields: An optional list of field mappings for child field of structs, 
maps, and lists.
   
   To walk through this whole scenario.
   
   Imagine I have a table `A, B, C` which iceberg will internally note as 
Fields (0, 1, 2)
   
   I import a file (HiveFile) from hive that contains `A, B, C`. This triggers 
(at least in some of our actions like Migrate or Snapshot) the creation of a 
Name Mapping. 
   ```
   { 
     A -> 0,
     B -> 1, 
     C -> 2
     }
   ```
   
   When reading HiveFile I see that I only have column names, no field IDs are 
present in the File metadata. So I use the mapping specified above to say which 
columns actually belong to which fields.
   
   A file written by any Iceberg writer (IcebergFile) would contain embedded in 
it the mapping of this particular file. 
   
   ```
   Footer {
   {  iceberg.schema {
   A -> 0, B -> 1, C -> 2
   }}
   ```
   
   Now when we drop C and ADD C our new Table still has names
   `A, B, C` but fields `A -> 0, B->  1, C->3` 
   
   So for IcebergFile we don't have a problem. Because we look at that file and 
see it only has field "2" not "3" so it can't possibly have values for our new 
C (3).
   
   Now the Table's name mapping should have changed here. The C that we 
originally mapped "C -> 2" was dropped so we should no longer map the column 
from the Hive table. But I believe our current behavior is to change the name 
mapping so that the name mapping also contains "C -> 3". This is incorrect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Reply via email to