[
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272016#comment-14272016
]
Yin Huai commented on SPARK-4912:
---------------------------------
h3. Persistence of metadata
Right now all tables created through the data sources API are ephemeral and
share fate with the SQLContext that creates them.
Requirements:
* Allow the creation of tables that persist across invocations of the
SQLContext and are visible to other instances.
* Caching of BaseRelation instances for performance as some do expensive
discovery operations upon creation (schema inference, partition discovery, etc).
* The ability to refresh a cached instance manually.
h4. Proposed Solution
When the word TEMPORARY is omitted and a HiveContext is used, Spark SQL will
create tables in the Hive Metastore. The table properties will be used to store
all properties used to create a BaseRelation. The SerDe properties will be used
to store all user provided properties in OPTIONS. When a schema is specified,
it will also be stored in the metastore. Standard HiveDDL can be used to alter
the schema or options of a table. In this case we may want to intercept these
options and
For user-defined data types, one issue is that Hive internally validates data
types and throws an exception when a data type string cannot be recognized. To
workaround this issue, we will store schemas in the table property. Through
this way, we can also support data types add in future that are not supported
by Hive.
When a table is loaded from the metastore and the dummy input format is
detected, Spark SQL will instead invoke the specified relation provider.
Commands will be added to the API for refreshing cached BaseRelations. This
can be used on both persistent and temporary tables.
Programmatic API
{code}
def refreshTable(tableName: String)
{code}
In SQL
{code:sql}
REFRESH TABLE <TABLE NAME>
{code}
> Persistent data source tables
> -----------------------------
>
> Key: SPARK-4912
> URL: https://issues.apache.org/jira/browse/SPARK-4912
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Michael Armbrust
> Assignee: Michael Armbrust
> Priority: Blocker
>
> It would be good if tables created through the new data sources api could be
> persisted to the hive metastore.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]