[ 
https://issues.apache.org/jira/browse/SPARK-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378277#comment-15378277
 ] 

Apache Spark commented on SPARK-16552:
--------------------------------------

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14207

> Store the Inferred Schemas into External Catalog Tables when Creating Tables
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-16552
>                 URL: https://issues.apache.org/jira/browse/SPARK-16552
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Xiao Li
>
> Currently, in Spark SQL, the initial creation of schema can be classified 
> into two groups. It is applicable to both Hive tables and Data Source tables:
> Group A. Users specify the schema. 
> Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema 
> of the SELECT clause. For example,
> {noformat}
> CREATE TABLE tab STORED AS TEXTFILE
> AS SELECT * from input
> {noformat}
> Case 2 CREATE TABLE: users explicitly specify the schema. For example,
> {noformat}
> CREATE TABLE jsonTable (_1 string, _2 string)
> USING org.apache.spark.sql.json
> {noformat}
> Group B. Spark SQL infer the schema at runtime.
> Case 3 CREATE TABLE. Users do not specify the schema but the path to the file 
> location. For example,
> {noformat}
> CREATE TABLE jsonTable 
> USING org.apache.spark.sql.json
> OPTIONS (path '${tempDir.getCanonicalPath}')
> {noformat}
> Now, Spark SQL does not store the inferred schema in the external catalog for 
> the cases in Group B. When users refreshing the metadata cache, accessing the 
> table at the first time after (re-)starting Spark, Spark SQL will infer the 
> schema and store the info in the metadata cache for improving the performance 
> of subsequent metadata requests. However, the runtime schema inference could 
> cause undesirable schema changes after each reboot of Spark.
> It is desirable to store the inferred schema in the external catalog when 
> creating the table. When users intend to refresh the schema, they issue 
> `REFRESH TABLE`. Spark SQL will infer the schema again based on the 
> previously specified table location and update/refresh the schema in the 
> external catalog and metadata cache. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to