[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

aarondav Mon, 04 Aug 2014 00:36:08 -0700

GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/1764


    [SPARK-2824/2825][SQL] Work towards separating data location from format

    Currently, there is a fundamental assumption in SparkSQL that a Parquet
    table is stored at a certain Hadoop path and that a Metastore table is
    stored within the Hive warehouse. However, the fact that a table is
    Parquet or serialized as an object file is independent of where the data
    is actually located.
    
    This patch attempts to work towards creating a cleaner separation between
    where the data is located and the format the data is in by introducing
    two concepts: a TableFormat and a TableLocation. This abstraction
    enables code like the following:
    
    ```scala
    val myTable = // ...
    myTable.saveAsTable("myTable", classOf[ParquetFormat])
    hql("SELECT * FROM myTable").collect // reads from Parquet!
    
    // Also allows expansion of file-writing later:
    myTable.saveAsFile("/my/file", classOf[ParquetFormat])
    ```
    
    Additionally, this allows us to trivially support external tables with 
arbitrary formats.
    
    However, this PR doesn't attempt to make any radical changes. Parquet files 
still only
    support being written to a single Hadoop directory, but this can be part of 
a Hive table or
    a normal directory. The MetastoreRelation still requires living within the 
Metastore because
    it relies heavily on the metadata there. The hope of this patch is that it 
enables the two linked
    features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824) 
and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while 
adding a useful abstraction for the future.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark hive

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1764
    
----
commit b18a4ae3c9bbe8f917ccc7e9aeb1ece25d54bc46
Author: Aaron Davidson <[email protected]>
Date:   2014-08-02T19:55:41Z

    [SPARK-2824/2825][SQL] Work towards separating data location from format
    
    Currently, there is a fundamental assumption in SparkSQL that a Parquet
    table is stored at a certain Hadoop path and that a Metastore table is
    stored within the Hive warehouse. However, the fact that a table is
    Parquet or serialized as an object file is independent of where the data
    is actually located.
    
    This patch attempts to work towards creating a cleaner separation between
    where the data is located and the format the data is in by introducing
    two concepts: a TableFormat and a TableLocation. This abstraction
    enables code like the following:
    
    ```scala
    val myTable = //
    myTable.saveAsTable("myTable", classOf[ParquetFormat])
    hql("SELECT * FROM myTable").collect // reads from Parquet!
    
    // Also allows expansion of file-writing later:
    myTable.saveAsFile("/my/file", classOf[ParquetFormat])
    ```
    
    Additionally, this allows us to trivially support external tables
    with arbitrary formats.
    
    However, this PR doesn't attempt to make any radical changes. Parquet files 
still only
    support being written to a single Hadoop directory, but this can be part of 
a Hive table or
    a normal directory. The MetastoreRelation still requires living within the 
Metastore because
    it relies heavily on the metadata there. The hope of this patch is that it 
enables the two linked
    features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824) 
and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while 
adding a useful abstraction for the
    future.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

Reply via email to