[GitHub] [spark] cloud-fan opened a new pull request #25651: [SPARK-28948][SQL] support data source v2 in CREATE TABLE USING

GitBox Mon, 02 Sep 2019 05:07:26 -0700

cloud-fan opened a new pull request #25651: [SPARK-28948][SQL] support data
source v2 in CREATE TABLE USING
URL: https://github.com/apache/spark/pull/25651

### What changes were proposed in this pull request?

Currently Data Source V2 has 2 major use cases:
1. users plug in a custom catalog, which is tightly coupled with its own
data. For example, users can plug in a cassandra catalog, and use Spark to
read/write cassandra tables directly.
2. users read/write the external data as a table directly via
`DataFrameReader/Writer`.

Use case 1 is newly introduced in the master branch, which greatly improves
the user experience when interacting with external storage systems that have
catalogs, e.g. cassandra, JDBC, etc.

Use case 2 is the main use case of Data Source V1, which works well if the
external storage system doesn't have a catalog, e.g. parquet files on S3.

However, use case 2 is incompleted in Data Source V2. Users can register a
v1 source as a table in the builtin catalog, e.g. `CREATE TABLE t(i INT) USING
parquet`, and then read/write the registered table. This is more convenient
than `DataFrameReader/Writer`. However, Data Source V2 doesn't support it well.

To support it, this PR updates `TableProvider#getTable` to accept additional
partitioning info. The expected behaviors are defined in
https://docs.google.com/document/d/1oaS0eIVL1WsCjr4CqIpRv6CGkS5EoMQrngn3FsY1d-Q/edit?usp=sharing

### Why are the changes needed?

Make Data Source V2 supports the use case that is supported by Data Source
V1.

### Does this PR introduce any user-facing change?

Yes, it's a new feature

### How was this patch tested?

a new test suite


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan opened a new pull request #25651: [SPARK-28948][SQL] support data source v2 in CREATE TABLE USING

Reply via email to