[
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261398#comment-17261398
]
Attila Zsolt Piros edited comment on SPARK-26836 at 1/8/21, 4:15 PM:
---------------------------------------------------------------------
I am working on this and a PR can be expected this weekend / next week
was (Author: attilapiros):
I am working on a this
> Columns get switched in Spark SQL using Avro backed Hive table if schema
> evolves
> --------------------------------------------------------------------------------
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.1, 2.4.0
> Environment: I tested with Hive and HCatalog which runs on version
> 2.3.4 and with Spark 2.3.1 and 2.4
> Reporter: Tamás Németh
> Priority: Critical
> Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro,
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the
> avro files.
> In the table definiton the avro.schema.url always points to the latest
> partition's _schema.avsc file which is always the lates schema. (Avro schemas
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the
> avro.schema.url properties also set to the _schema.avsc which was used when
> it was added and of course I always update the table avro.schema.url property
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new
> optional property is added in the middle.
> When this happens then after the spark sql query the columns in the old
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives
> me back the correct columns for the partitions which were created the old
> schema and for the new which was created the evolved schema.
>
> Here is how I could reproduce with the
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
> example data in sql test suite.
> # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before
> evolving and the second one had the evolved one. (the evolved schema is the
> same as in your testcase except I moved the extra_field column to the last
> from the second and I generated two lines of avro data with the evolved
> schema.
> # I have created a hive table with the following command:
>
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
> PARTITIONED BY (
> `dt` string
> )
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> WITH SERDEPROPERTIES (
> 'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> LOCATION
> 's3://somelocation/doctors/'
> TBLPROPERTIES (
> 'transient_lastDdlTime'='1538130975'){code}
>
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the
> schema which is under the partition folder (non-evolved one ->
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be
> mixed up (on the data below the first name column becomes the extra_field
> column. I guess because in the latest schema it is the second column):
>
> {code:java}
> number,extra_field,first_name,last_name,dt
> 6,Colin,Baker,null,2019-02-05
> 3,Jon,Pertwee,null,2019-02-05
> 4,Tom,Baker,null,2019-02-05
> 5,Peter,Davison,null,2019-02-05
> 11,Matt,Smith,null,2019-02-05
> 1,William,Hartnell,null,2019-02-05
> 7,Sylvester,McCoy,null,2019-02-05
> 8,Paul,McGann,null,2019-02-05
> 2,Patrick,Troughton,null,2019-02-05
> 9,Christopher,Eccleston,null,2019-02-05
> 10,David,Tennant,null,2019-02-05
> 21,fishfinger,Jim,Baker,2019-02-06
> 24,fishfinger,Bean,Pertwee,2019-02-06
> {code}
> If I try the same query from Hive and not from spark sql then everything is
> fine and it never switches the columns.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]