[jira] [Created] (SPARK-31574) Schema evolution in spark while using the storage format as parquet

sharad Gupta (Jira) Mon, 27 Apr 2020 01:09:46 -0700

sharad Gupta created SPARK-31574:
------------------------------------

             Summary: Schema evolution in spark while using the storage format 
as parquet
                 Key: SPARK-31574
                 URL: https://issues.apache.org/jira/browse/SPARK-31574
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: sharad Gupta

Hi Team,

Use case:

Suppose there is a table T1 with column C1 with datatype as int in schema
version 1. In the first on boarding table T1. I wrote couple of parquet files
with this schema version 1 with underlying file format used parquet.

Now in schema version 2 the C1 column datatype changed to string from int. Now
It will write data with schema version 2 in parquet.

So some parquet files are written with schema version 1 and some written with
schema version 2.

Problem statement :

1. We are not able to execute the below command from spark sql
```Alter table Table T1 change C1 C1 string```

2. So as a solution i goto hive and alter the table change datatype because it
supported in hive then try to read the data in spark. So it is giving me error

```

Caused by: java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)

at
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)

at
org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)

at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)

at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)

at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:109)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)```

3. Suspecting that the underlying parquet file is written with integer type and
we are reading from a table whose column is changed to string type. So that is
why it is happening.

How you can reproduce this:
spark sql
1. Create a table from spark sql with one column with datatype as int with
stored as parquet.
2. Now put some data into table.
3. Now you can see the data if you select from table.

Hive
1. change datatype from int to string by alter command
2. Now try to read data, You will be able to read the data here even after
changing the datatype.

spark sql
1. Try to read data from here now you will see the error.

Now the question is how to solve schema evolution in spark while using the
storage format as parquet.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-31574) Schema evolution in spark while using the storage format as parquet

Reply via email to