GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/2475
[WIP][SPARK-3247][SQL] An API for adding foreign data sources to Spark SQL
**Work in progress - APIs may change**
This PR introduces a new set of APIs to Spark SQL that allow other
developers to add support for reading data from new sources. As an example, a
library is included for reading data encoded using Avro.
New sources must implement the interface `BaseRelation`, which is
responsible for describing the schema of the data. This base relation must
also implement at least one `Scan` interface, which is responsible for
producing an RDD containing row objects. The various Scan interfaces allow for
optimizations such as column pruning and filter push down, when the underlying
data source can handle these operations.
External data sources can be accessed using either the programatic API or
using pure SQL. For example, the included Avro library could be called from
Scala query DSL as follows:
```scala
import org.apache.spark.sql.avro._
val results = TestSQLContext
.avroFile("../hive/src/test/resources/data/files/episodes.avro")
.select('title)
.collect()
```
The same can be done in pure SQL, for example from the SQL command line or
JDBC interface.
```sql
CREATE FOREIGN TEMPORARY TABLE avroTable
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro");
SELECT * FROM avroTable;
```
TODO:
- [ ] Move command refactoring into separate PR
- [ ] Transition parquet and json support to new API
- [ ] Figure out how to package data sources and their dependencies
- [ ] Examples / implementation of more advanced scan types
- [ ] Support for foreign catalogs
- [ ] Introspection like `describe` for foreign tables.
- [ ] More tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark foreign
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2475.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2475
----
commit 47d542cc0238fba04b6c4e4456393d812d559c4e
Author: Michael Armbrust <[email protected]>
Date: 2014-09-20T23:20:50Z
First draft of foreign data API
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]