[GitHub] [spark] cloud-fan opened a new pull request #24233: [SPARK-26356][SQL] remove SaveMode from data source v2

GitBox Wed, 27 Mar 2019 21:46:31 -0700

cloud-fan opened a new pull request #24233: [SPARK-26356][SQL] remove SaveMode
from data source v2
URL: https://github.com/apache/spark/pull/24233

## What changes were proposed in this pull request?

In data source v1, save mode specified in `DataFrameWriter` is passed to
data source implementation directly, and each data source can define its own
behavior about save mode. This is confusing and we want to get rid of save mode
in data source v2.

For data source v2, we expect data source to implement the `TableCatalog`
API, and end-users use SQL(or the new write API described in [this
doc](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5ace0718#heading=h.e9v1af12g5zo))
to acess data sources. The SQL API has very clear semantic and we don't need
save mode at all.

However, for simple data sources that do not have table management (like a
JIRA data source, a noop sink, etc.), it's not ideal to ask them to implement
the `TableCatalog` API, and throw exception here and there.

`TableProvider` API is created for simple data sources. It can only get
tables, without any other table management methods. This means, it can only
deal with existing tables.

`TableProvider` fits well with `DataStreamReader` and `DataStreamWriter`, as
they can only read/write existing tables. However, `TableProvider` doesn't fit
`DataFrameWriter` well, as the save mode requires more than just get table.
More specifically, `ErrorIfExists` mode needs to check if table exists, and
create table. `Ignore` mode needs to check if table exists. When end-users
specify `ErrorIfExists` or `Ignore` mode and write data to `TableProvider` via
`DataFrameWriter`, Spark fails the query and asks users to use `Append` or
`Overwrite` mode.

The file source is in the middle of `TableProvider` and `TableCatalog`: it's
simple but it can check table(path) exists and create table(path). That said,
file source supports all the save modes.

Currently file source implements `TableProvider`, and it's not working
because `TableProvider` doesn't support `ErrorIfExists` and `Ignore` modes.
Ideally we should create a new API for path-based data sources, but to unblock
the work of file source v2 migration, this PR proposes to special-case file
source v2 in `DataFrameWriter`, to make it work.

This PR also removes `SaveMode` from data source v2, as now only the
internal file source v2 needs it.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)

Please review http://spark.apache.org/contributing.html before opening a
pull request.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan opened a new pull request #24233: [SPARK-26356][SQL] remove SaveMode from data source v2

Reply via email to