[
https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-31751:
---------------------------------
Flags: (was: Important)
> spark serde property path overwrites table property location
> ------------------------------------------------------------
>
> Key: SPARK-31751
> URL: https://issues.apache.org/jira/browse/SPARK-31751
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.1, 2.4.5
> Reporter: Nithin
> Priority: Major
>
> This is an issue that have caused us so many data errors.
> 1) using spark ( with hive context enabled )
> df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}])
> df.write.format("orc").option("compression",
> "ZLIB").mode("overwrite").saveAsTable('test_spark');
>
> 2) from hive
> alter table test_spark rename to test_spark2
>
> 3)from spark-sql from command line ( note : not pyspark or spark-shell )
> select * from test_spark2
>
> will give output
> NULL NULL NULL
> Time taken: 0.334 seconds, Fetched 1 row(s)
>
> This will throw NULL because , pyspark write API will add a serde property
> called path into the hive metastore. when hive renames the table , it do not
> understand this serde and hence keep it as it is. Now when spark-sql tries to
> read it , it will honor the serde property first and then tries to read from
> the non-existent hdfs location. If it had given an error , then also it would
> have been fine , but throwing out NULL will cause applications to fail pretty
> bad. Spark claims to support hive tables , hence it should respect hive
> metastore location property rather than spark serde property when trying to
> read a table. This cannot be classified as a expected behaviour.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]