[GitHub] [spark] attilapiros commented on a change in pull request #31133: [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables

GitBox Tue, 12 Jan 2021 00:32:14 -0800


attilapiros commented on a change in pull request #31133:
URL: https://github.com/apache/spark/pull/31133#discussion_r555592091




##########
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
##########
@@ -1883,6 +1883,60 @@ class HiveDDLSuite
     }
   }
 
+  test("SPARK-26836: support Avro schema evolution") {
+    withTable("t") {
+      val originalSchema =
+        """
+          |{
+          |  "namespace": "test",
+          |  "name": "some_schema",
+          |  "type": "record",
+          |  "fields": [
+          |    {
+          |      "name": "col2",

Review comment:
       Yes, regarding schema evolution rules you can add a field to an 
arbitrary position, see the first example here:
   
https://docs.oracle.com/database/nosql-12.1.3.0/GettingStartedGuide/schemaevolution.html
    
   My intention with this example and field naming was to illustrate the worst 
case (the column mismatch error) and emphasize its root cause. And even by 
adding new fields at the end one just decrease the scope of the problem as 
there will be still wrong values (null) for the new fields. 
   
   Moreover when an existing field is removed the column mismatch usually 
cannot be avoided, example:
   
   ```
   sql("""
     CREATE TABLE t PARTITIONED BY (ds string)
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
     WITH SERDEPROPERTIES ('avro.schema.literal'='
     {
       "namespace": "test",
       "name": "some_schema",
       "type": "record",
       "fields": [
         {
           "name": "col1",
           "type": "string",
           "default": "col1_default"
         },
         {
           "name": "col2",
           "type": "string"
         }
       ]
     }')
     STORED AS
     INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
   """)
   
   sql("""
     INSERT INTO t partition (ds='1981-01-07') VALUES ('col1_value', 
'col2_value')
   """)
   
   sql("""
     ALTER TABLE t SET SERDEPROPERTIES ('avro.schema.literal'='
     {
       "namespace": "test",
       "name": "some_schema",
       "type": "record",
       "fields": [
         {
           "name": "col2",
           "type": "string"
         }
       ]
     }')
   """)
   
   sql("""
     INSERT INTO t partition (ds='1983-04-27') VALUES ('col2_value')
   """)
   
   sql("""
     select * from t
   """).show()
   ```
   
   Without this PR:
   ```
   +------------+----------+
   |        col2|        ds|
   +------------+----------+
   |col1_default|1981-01-07|
   |  col2_value|1983-04-27|
   +------------+----------+
   ```
   
   With the fix:
   ```
   +----------+----------+
   |      col2|        ds|
   +----------+----------+
   |col2_value|1981-01-07|
   |col2_value|1983-04-27|
   +----------+----------+
   ```
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] attilapiros commented on a change in pull request #31133: [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables

Reply via email to