[
https://issues.apache.org/jira/browse/SPARK-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-16552.
---------------------------------
Resolution: Fixed
Fix Version/s: 2.1.0
Issue resolved by pull request 14207
[https://github.com/apache/spark/pull/14207]
> Store the Inferred Schemas into External Catalog Tables when Creating Tables
> ----------------------------------------------------------------------------
>
> Key: SPARK-16552
> URL: https://issues.apache.org/jira/browse/SPARK-16552
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Xiao Li
> Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Currently, in Spark SQL, the initial creation of schema can be classified
> into two groups. It is applicable to both Hive tables and Data Source tables:
> Group A. Users specify the schema.
> Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema
> of the SELECT clause. For example,
> {noformat}
> CREATE TABLE tab STORED AS TEXTFILE
> AS SELECT * from input
> {noformat}
> Case 2 CREATE TABLE: users explicitly specify the schema. For example,
> {noformat}
> CREATE TABLE jsonTable (_1 string, _2 string)
> USING org.apache.spark.sql.json
> {noformat}
> Group B. Spark SQL infer the schema at runtime.
> Case 3 CREATE TABLE. Users do not specify the schema but the path to the file
> location. For example,
> {noformat}
> CREATE TABLE jsonTable
> USING org.apache.spark.sql.json
> OPTIONS (path '${tempDir.getCanonicalPath}')
> {noformat}
> Now, Spark SQL does not store the inferred schema in the external catalog for
> the cases in Group B. When users refreshing the metadata cache, accessing the
> table at the first time after (re-)starting Spark, Spark SQL will infer the
> schema and store the info in the metadata cache for improving the performance
> of subsequent metadata requests. However, the runtime schema inference could
> cause undesirable schema changes after each reboot of Spark.
> It is desirable to store the inferred schema in the external catalog when
> creating the table. When users intend to refresh the schema, they issue
> `REFRESH TABLE`. Spark SQL will infer the schema again based on the
> previously specified table location and update/refresh the schema in the
> external catalog and metadata cache.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]