[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

GitBox Wed, 08 Jun 2022 09:20:02 -0700


RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1150127669


   I believe the proper behavior should be that neither allows the old data to 
be read back. Currently I wrote a repo
   
   ```scala
   scala> spark.sql("CREATE external TABLE migratetest (foo int, bar int, zaz 
int) USING PARQUET LOCATION '/Users/russellspitzer/Temp/migratetest'").show
   
   scala> spark.sql("INSERT INTO migratetest (foo, bar , zaz) VALUES (1, 1, 1)")
   res7: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("call 
spark_catalog.system.migrate('spark_catalog.default.migratetest')")
   res8: org.apache.spark.sql.DataFrame = [migrated_files_count: bigint]
   
   scala> spark.sql("SELECT * FROM migratetest").show
   +---+---+---+
   |foo|bar|zaz|
   +---+---+---+
   |  1|  1|  1|
   +---+---+---+
   
   scala> spark.sql("ALTER TABLE migratetest DROP COLUMN foo")
   res10: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT * FROM migratetest").show
   +---+---+
   |bar|zaz|
   +---+---+
   |  1|  1|
   +---+---+
   
   scala> spark.sql("ALTER TABLE migratetest ADD COLUMN foo int")
   res12: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT * FROM migratetest").show
   +---+---+---+
   |bar|zaz|foo|
   +---+---+---+
   |  1|  1|  1|
   ```
   
   The issue here is that the default name mapping is changed when the second 
foo column is added, overriding the original name mapping.
   
   Name Mapping in original table : Foo maps to 1
   ```json
   [
    {\n  \"field-id\" : 1,\n  \"names\" : [ \"foo\" ]\n},
    {\n  \"field-id\" : 2,\n  \"names\" : [ \"bar\" ]\n}, 
    {\n  \"field-id\" : 3,\n  \"names\" : [ \"zaz\" ]\n} ]```
   ```
   
   Name Mapping after dropping "foo" : Foo still maps to 1
   ```json
   [ {\n  \"field-id\" : 1,\n  \"names\" : [ \"foo\" ]\n}, 
   {\n  \"field-id\" : 2,\n  \"names\" : [ \"bar\" ]\n}, 
   {\n  \"field-id\" : 3,\n  \"names\" : [ \"zaz\" ]\n} 
   |]
   
   Name Mapping after adding "foo" back : Foo now maps to 4 *This is incorrect 
we should not be changing the existing mapping*
   ```json
   [ {\n  \"field-id\" : 1,\n  \"names\" : [ ]\n}, 
   {\n  \"field-id\" : 2,\n  \"names\" : [ \"bar\" ]\n},
    {\n  \"field-id\" : 3,\n  \"names\" : [ \"zaz\" ]\n}, 
    {\n  \"field-id\" : 4,\n  \"names\" : [ \"foo\" ]\n} ]
   ```
   
   I'm on vacation now so i'm not going to look into this more, but IMHO that 
final default name mapping should be identical to the one when dropping the 
column. So the error here is in the "ADD COLUMN" code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Reply via email to