[ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26836:
---------------------------------
    Priority: Critical  (was: Blocker)

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-26836
>                 URL: https://issues.apache.org/jira/browse/SPARK-26836
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1, 2.4.0
>         Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>            Reporter: Tamás Németh
>            Priority: Critical
>              Labels: correctness
>         Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be 
> mixed up (on the data below the first name column becomes the extra_field 
> column. I guess because in the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
> 9,Christopher,Eccleston,null,2019-02-05 
> 10,David,Tennant,null,2019-02-05 
> 21,fishfinger,Jim,Baker,2019-02-06 
> 24,fishfinger,Bean,Pertwee,2019-02-06
> {code}
> If I try the same query from Hive and not from spark sql then everything is 
> fine and it never switches the columns.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to