[ https://issues.apache.org/jira/browse/SPARK-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378277#comment-15378277 ]
Apache Spark commented on SPARK-16552: -------------------------------------- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14207 > Store the Inferred Schemas into External Catalog Tables when Creating Tables > ---------------------------------------------------------------------------- > > Key: SPARK-16552 > URL: https://issues.apache.org/jira/browse/SPARK-16552 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.0.0 > Reporter: Xiao Li > > Currently, in Spark SQL, the initial creation of schema can be classified > into two groups. It is applicable to both Hive tables and Data Source tables: > Group A. Users specify the schema. > Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema > of the SELECT clause. For example, > {noformat} > CREATE TABLE tab STORED AS TEXTFILE > AS SELECT * from input > {noformat} > Case 2 CREATE TABLE: users explicitly specify the schema. For example, > {noformat} > CREATE TABLE jsonTable (_1 string, _2 string) > USING org.apache.spark.sql.json > {noformat} > Group B. Spark SQL infer the schema at runtime. > Case 3 CREATE TABLE. Users do not specify the schema but the path to the file > location. For example, > {noformat} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json > OPTIONS (path '${tempDir.getCanonicalPath}') > {noformat} > Now, Spark SQL does not store the inferred schema in the external catalog for > the cases in Group B. When users refreshing the metadata cache, accessing the > table at the first time after (re-)starting Spark, Spark SQL will infer the > schema and store the info in the metadata cache for improving the performance > of subsequent metadata requests. However, the runtime schema inference could > cause undesirable schema changes after each reboot of Spark. > It is desirable to store the inferred schema in the external catalog when > creating the table. When users intend to refresh the schema, they issue > `REFRESH TABLE`. Spark SQL will infer the schema again based on the > previously specified table location and update/refresh the schema in the > external catalog and metadata cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org