[jira] [Commented] (SPARK-4912) Persistent data source tables

Yin Huai (JIRA) Fri, 09 Jan 2015 14:50:08 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272016#comment-14272016
 ]


Yin Huai commented on SPARK-4912:
---------------------------------

h3. Persistence of metadata
 
Right now all tables created through the data sources API are ephemeral and 
share fate with the SQLContext that creates them.

Requirements:
* Allow the creation of tables that persist across invocations of the 
SQLContext and are visible to other instances.
* Caching of BaseRelation instances for performance as some do expensive 
discovery operations upon creation (schema inference, partition discovery, etc).
* The ability to refresh a cached instance manually.

h4. Proposed Solution
When the word TEMPORARY is omitted and a HiveContext is used, Spark SQL will 
create tables in the Hive Metastore. The table properties will be used to store 
all properties used to create a BaseRelation. The SerDe properties will be used 
to store all user provided properties in OPTIONS. When a schema is specified, 
it will also be stored in the metastore.  Standard HiveDDL can be used to alter 
the schema or options of a table. In this case we may want to intercept these 
options and 

For user-defined data types, one issue is that Hive internally validates data 
types and throws an exception when a data type string cannot be recognized. To 
workaround this issue, we will store schemas in the table property. Through 
this way, we can also support data types add in future that are not supported 
by Hive.

When a table is loaded from the metastore and the dummy input format is 
detected, Spark SQL will instead invoke the specified relation provider.

Commands will be added to the API for refreshing cached BaseRelations.  This 
can be used on both persistent and temporary tables.

Programmatic API 
{code}
def refreshTable(tableName: String)
{code}
In SQL
{code:sql}
REFRESH TABLE <TABLE NAME>
{code}

> Persistent data source tables
> -----------------------------
>
>                 Key: SPARK-4912
>                 URL: https://issues.apache.org/jira/browse/SPARK-4912
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
>            Assignee: Michael Armbrust
>            Priority: Blocker
>
> It would be good if tables created through the new data sources api could be 
> persisted to the hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4912) Persistent data source tables

Reply via email to