sharad Gupta created SPARK-31574:
------------------------------------

             Summary: Schema evolution in spark while using the storage format 
as parquet
                 Key: SPARK-31574
                 URL: https://issues.apache.org/jira/browse/SPARK-31574
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: sharad Gupta


Hi Team,

 

Use case:

Suppose there is a table T1 with column C1 with datatype as int in schema 
version 1. In the first on boarding table T1. I wrote couple of parquet files 
with this schema version 1 with underlying file format used parquet.

Now in schema version 2 the C1 column datatype changed to string from int. Now 
It will write data with schema version 2 in parquet.

So some parquet files are written with schema version 1 and some written with 
schema version 2.

Problem statement :

1. We are not able to execute the below command from spark sql
```Alter table Table T1 change C1 C1 string```

2. So as a solution i goto hive and alter the table change datatype because it 
supported in hive then try to read the data in spark. So it is giving me error



```

Caused by: java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

  at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)

  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)

  at 
org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)

  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)

  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)

  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)

  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)

  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

  at org.apache.spark.scheduler.Task.run(Task.scala:109)

  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

  at java.lang.Thread.run(Thread.java:745)```

 

3. Suspecting that the underlying parquet file is written with integer type and 
we are reading from a table whose column is changed to string type. So that is 
why it is happening.

How you can reproduce this:
spark sql
1. Create a table from spark sql with one column with datatype as int with 
stored as parquet.
2. Now put some data into table.
3. Now you can see the data if you select from table.

Hive
1. change datatype from int to string by alter command
2. Now try to read data, You will be able to read the data here even after 
changing the datatype.



spark sql 
1. Try to read data from here now you will see the error.

Now the question is how to solve schema evolution in spark while using the 
storage format as parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to