[GitHub] [iceberg] RussellSpitzer commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

via GitHub Sun, 07 May 2023 12:42:25 -0700


RussellSpitzer commented on issue #6122:
URL: https://github.com/apache/iceberg/issues/6122#issuecomment-1537525136


   > Do you think this should be considered as bug and it should be fixed in 
IcebergGenerics?
   The fact that the error message changed to a different column suggests that 
the mapping is in fact working but that the field_ids are wrong. I didn't know 
field mapping worked with generics, but the fact that there is a different 
missmatch suggests that it is being used. What you need to do is actually 
determine the correct field Id's for your schema and make sure those match. If 
the field Id's are correct, then you could file a bug with a reproduction.
   
   > I mean shouldn't any tool which is being used to read iceberg data be 
using fallback mechanism too if data files don't contain field-ids? I don't 
think the data files need to contain the field-ids. It's not because we didn't 
implement this. It's mainly because this information is anyway present in 
iceberg metadata json files. So same information doesn't need to be there in 2 
places. Please let me know about your thoughts.
   
   The information is not in too places, if you think the information is 
duplicated you may have a misconception about how Iceberg handles schema 
evolution and schema in general.
   
   Imagine you have a table with a column X, then drop column X and add a new 
column X.
   
   In a hive table, doing this would resurrect the data from column x because 
it uses *name mapping only*. Or imagine that you simply wanted to drop column X 
and fill in a new column X with a different type. Again this is going to cause 
issues in hive because we have no way of differentiating old "x" from new "x"
   
   To address issues like this Iceberg instead has a mapping between columns in 
files and the logical column in the table. 
   
   In Iceberg you end up with two different "column x"s each with a different 
field ID. Files written either have explicit field-id's written with their 
columns OR we need to give a backup mapping of those columns to a field ID. 
Remember, we now have 2 different X's in the tables history and in almost all 
situations we do not expect this data to be valid for both of these points in 
time. This is why field-id is required. The metadata.json only contains 
information about what field id maps to what column, the actual column name in 
the schema isn't relevant. If we used the current schema names as a mapping it 
would lead to resurrection/schema evolution issues (as noted above) when the 
schema changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on issue #6122: IcebergGenerics.read(table) doesn't work as expected

Reply via email to