sharad Gupta created SPARK-31574:
------------------------------------
Summary: Schema evolution in spark while using the storage format
as parquet
Key: SPARK-31574
URL: https://issues.apache.org/jira/browse/SPARK-31574
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.3.0
Reporter: sharad Gupta
Hi Team,
Use case:
Suppose there is a table T1 with column C1 with datatype as int in schema
version 1. In the first on boarding table T1. I wrote couple of parquet files
with this schema version 1 with underlying file format used parquet.
Now in schema version 2 the C1 column datatype changed to string from int. Now
It will write data with schema version 2 in parquet.
So some parquet files are written with schema version 1 and some written with
schema version 2.
Problem statement :
1. We are not able to execute the below command from spark sql
```Alter table Table T1 change C1 C1 string```
2. So as a solution i goto hive and alter the table change datatype because it
supported in hive then try to read the data in spark. So it is giving me error
```
Caused by: java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at
org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)```
3. Suspecting that the underlying parquet file is written with integer type and
we are reading from a table whose column is changed to string type. So that is
why it is happening.
How you can reproduce this:
spark sql
1. Create a table from spark sql with one column with datatype as int with
stored as parquet.
2. Now put some data into table.
3. Now you can see the data if you select from table.
Hive
1. change datatype from int to string by alter command
2. Now try to read data, You will be able to read the data here even after
changing the datatype.
spark sql
1. Try to read data from here now you will see the error.
Now the question is how to solve schema evolution in spark while using the
storage format as parquet.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]