GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/1764
[SPARK-2824/2825][SQL] Work towards separating data location from format
Currently, there is a fundamental assumption in SparkSQL that a Parquet
table is stored at a certain Hadoop path and that a Metastore table is
stored within the Hive warehouse. However, the fact that a table is
Parquet or serialized as an object file is independent of where the data
is actually located.
This patch attempts to work towards creating a cleaner separation between
where the data is located and the format the data is in by introducing
two concepts: a TableFormat and a TableLocation. This abstraction
enables code like the following:
```scala
val myTable = // ...
myTable.saveAsTable("myTable", classOf[ParquetFormat])
hql("SELECT * FROM myTable").collect // reads from Parquet!
// Also allows expansion of file-writing later:
myTable.saveAsFile("/my/file", classOf[ParquetFormat])
```
Additionally, this allows us to trivially support external tables with
arbitrary formats.
However, this PR doesn't attempt to make any radical changes. Parquet files
still only
support being written to a single Hadoop directory, but this can be part of
a Hive table or
a normal directory. The MetastoreRelation still requires living within the
Metastore because
it relies heavily on the metadata there. The hope of this patch is that it
enables the two linked
features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824)
and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while
adding a useful abstraction for the future.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark hive
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1764.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1764
----
commit b18a4ae3c9bbe8f917ccc7e9aeb1ece25d54bc46
Author: Aaron Davidson <[email protected]>
Date: 2014-08-02T19:55:41Z
[SPARK-2824/2825][SQL] Work towards separating data location from format
Currently, there is a fundamental assumption in SparkSQL that a Parquet
table is stored at a certain Hadoop path and that a Metastore table is
stored within the Hive warehouse. However, the fact that a table is
Parquet or serialized as an object file is independent of where the data
is actually located.
This patch attempts to work towards creating a cleaner separation between
where the data is located and the format the data is in by introducing
two concepts: a TableFormat and a TableLocation. This abstraction
enables code like the following:
```scala
val myTable = //
myTable.saveAsTable("myTable", classOf[ParquetFormat])
hql("SELECT * FROM myTable").collect // reads from Parquet!
// Also allows expansion of file-writing later:
myTable.saveAsFile("/my/file", classOf[ParquetFormat])
```
Additionally, this allows us to trivially support external tables
with arbitrary formats.
However, this PR doesn't attempt to make any radical changes. Parquet files
still only
support being written to a single Hadoop directory, but this can be part of
a Hive table or
a normal directory. The MetastoreRelation still requires living within the
Metastore because
it relies heavily on the metadata there. The hope of this patch is that it
enables the two linked
features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824)
and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while
adding a useful abstraction for the
future.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]