[GitHub] [spark] attilapiros opened a new pull request #31133: [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables

GitBox Mon, 11 Jan 2021 06:34:46 -0800


attilapiros opened a new pull request #31133:
URL: https://github.com/apache/spark/pull/31133



   ### What changes were proposed in this pull request?
   
   Before this PR for a partitioned Avro Hive table when the SerDe is 
configured to read the partition data
   the table level properties were overwritten by the partition level 
properties.
   
   This PR changes this ordering by giving table level properties higher 
precedence  thus when a new evolved schema 
   is set for the table this new schema will be used to read the partition data 
and not the original schema which was used for writing the data.
   
   This new behavior is consistent with Apache Hive. 
   See the example used in the unit test `SPARK-26836: support Avro schema 
evolution`, in Hive this results in: 
   
   ```
   0: jdbc:hive2://<IP>:10000> select * from t;
   INFO  : Compiling 
command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): 
select * from t
   INFO  : Semantic Analysis Completed
   INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, 
type:string, comment:null), FieldSchema(name:t.col2, type:string, 
comment:null), FieldSchema(name:t.ds, type:string, comment:null)], 
properties:null)
   INFO  : Completed compiling 
command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time 
taken: 0.098 seconds
   INFO  : Executing 
command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): 
select * from t
   INFO  : Completed executing 
command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time 
taken: 0.013 seconds
   INFO  : OK
   +---------------+-------------+-------------+
   |    t.col1     |   t.col2    |    t.ds     |
   +---------------+-------------+-------------+
   | col1_default  | col2_value  | 1981-01-07  |
   | col1_value    | col2_value  | 1983-04-27  |
   +---------------+-------------+-------------+
   2 rows selected (0.159 seconds)
   ```
   
   ### Why are the changes needed?
   
   Without this change the old schema would be used. This can use a correctness 
issue when the new schema introduces 
   a new field with a default value (following the rules of schema evolution) 
before an existing field. 
   In this case the rows coming from the partition where the old schema was 
used will **contain values in wrong column positions**.
   
   For example check the attached unit test `SPARK-26836: support Avro schema 
evolution` 
   
   Without this fix the result of the select on the table would be:
   
   ```
   +----------+----------+----------+
   |      col1|      col2|        ds|
   +----------+----------+----------+
   |col2_value|      null|1981-01-07|
   |col1_value|col2_value|1983-04-27|
   +----------+----------+----------+
   
   ```
   
   With this fix:
   
   ```
   +------------+----------+----------+
   |        col1|      col2|        ds|
   +------------+----------+----------+
   |col1_default|col2_value|1981-01-07|
   |  col1_value|col2_value|1983-04-27|
   +------------+----------+----------+
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Only when the new SQL configuration 
`spark.sql.hive.avroSchemaEvolution.enabled` is set to `true`.
   By default this configuration is set to `false`.
   
   The reason of introducing this configuration (and have it switched off by 
default) is to be consistent with the old results 
   for cases when no new field was introduced by the new schema (for example 
when just the default value of a field was changed).
   
   ### How was this patch tested?
   
   This was tested with the unit tested included to the PR. 
   And manually on Apache Spark / Hive.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] attilapiros opened a new pull request #31133: [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables

Reply via email to