attilapiros commented on a change in pull request #31133:
URL: https://github.com/apache/spark/pull/31133#discussion_r555592091
##########
File path:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
##########
@@ -1883,6 +1883,60 @@ class HiveDDLSuite
}
}
+ test("SPARK-26836: support Avro schema evolution") {
+ withTable("t") {
+ val originalSchema =
+ """
+ |{
+ | "namespace": "test",
+ | "name": "some_schema",
+ | "type": "record",
+ | "fields": [
+ | {
+ | "name": "col2",
Review comment:
Yes, regarding schema evolution rules you can add a field to an
arbitrary position, see the first example here:
https://docs.oracle.com/database/nosql-12.1.3.0/GettingStartedGuide/schemaevolution.html
My intention with this example and field naming was to illustrate the worst
case (the column mismatch error) and emphasize its root cause. And even by
adding new fields at the end one just decrease the scope of the problem as
there will be still wrong values (null) for the new fields.
Moreover when an existing field is removed the column mismatch usually
cannot be avoided, example:
```
sql("""
CREATE TABLE t PARTITIONED BY (ds string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='
{
"namespace": "test",
"name": "some_schema",
"type": "record",
"fields": [
{
"name": "col1",
"type": "string",
"default": "col1_default"
},
{
"name": "col2",
"type": "string"
}
]
}')
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
""")
sql("""
INSERT INTO t partition (ds='1981-01-07') VALUES ('col1_value',
'col2_value')
""")
sql("""
ALTER TABLE t SET SERDEPROPERTIES ('avro.schema.literal'='
{
"namespace": "test",
"name": "some_schema",
"type": "record",
"fields": [
{
"name": "col2",
"type": "string"
}
]
}')
""")
sql("""
INSERT INTO t partition (ds='1983-04-27') VALUES ('col2_value')
""")
sql("""
select * from t
""").show()
```
Without this PR:
```
+------------+----------+
| col2| ds|
+------------+----------+
|col1_default|1981-01-07|
| col2_value|1983-04-27|
+------------+----------+
```
With the fix:
```
+----------+----------+
| col2| ds|
+----------+----------+
|col2_value|1981-01-07|
|col2_value|1983-04-27|
+----------+----------+
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]