attilapiros opened a new pull request #31133: URL: https://github.com/apache/spark/pull/31133
### What changes were proposed in this pull request? Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data the table level properties were overwritten by the partition level properties. This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data. This new behavior is consistent with Apache Hive. See the example used in the unit test `SPARK-26836: support Avro schema evolution`, in Hive this results in: ``` 0: jdbc:hive2://<IP>:10000> select * from t; INFO : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds INFO : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds INFO : OK +---------------+-------------+-------------+ | t.col1 | t.col2 | t.ds | +---------------+-------------+-------------+ | col1_default | col2_value | 1981-01-07 | | col1_value | col2_value | 1983-04-27 | +---------------+-------------+-------------+ 2 rows selected (0.159 seconds) ``` ### Why are the changes needed? Without this change the old schema would be used. This can use a correctness issue when the new schema introduces a new field with a default value (following the rules of schema evolution) before an existing field. In this case the rows coming from the partition where the old schema was used will **contain values in wrong column positions**. For example check the attached unit test `SPARK-26836: support Avro schema evolution` Without this fix the result of the select on the table would be: ``` +----------+----------+----------+ | col1| col2| ds| +----------+----------+----------+ |col2_value| null|1981-01-07| |col1_value|col2_value|1983-04-27| +----------+----------+----------+ ``` With this fix: ``` +------------+----------+----------+ | col1| col2| ds| +------------+----------+----------+ |col1_default|col2_value|1981-01-07| | col1_value|col2_value|1983-04-27| +------------+----------+----------+ ``` ### Does this PR introduce _any_ user-facing change? Only when the new SQL configuration `spark.sql.hive.avroSchemaEvolution.enabled` is set to `true`. By default this configuration is set to `false`. The reason of introducing this configuration (and have it switched off by default) is to be consistent with the old results for cases when no new field was introduced by the new schema (for example when just the default value of a field was changed). ### How was this patch tested? This was tested with the unit tested included to the PR. And manually on Apache Spark / Hive. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
