cloud-fan opened a new pull request #24233: [SPARK-26356][SQL] remove SaveMode 
from data source v2
URL: https://github.com/apache/spark/pull/24233
 
 
   ## What changes were proposed in this pull request?
   
   In data source v1, save mode specified in `DataFrameWriter` is passed to 
data source implementation directly, and each data source can define its own 
behavior about save mode. This is confusing and we want to get rid of save mode 
in data source v2.
   
   For data source v2, we expect data source to implement the `TableCatalog` 
API, and end-users use SQL(or the new write API described in [this 
doc](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5ace0718#heading=h.e9v1af12g5zo))
 to acess data sources. The SQL API has very clear semantic and we don't need 
save mode at all.
   
   However, for simple data sources that do not have table management (like a 
JIRA data source, a noop sink, etc.), it's not ideal to ask them to implement 
the `TableCatalog` API, and throw exception here and there.
   
   `TableProvider` API is created for simple data sources. It can only get 
tables, without any other table management methods. This means, it can only 
deal with existing tables.
   
   `TableProvider` fits well with `DataStreamReader` and `DataStreamWriter`, as 
they can only read/write existing tables. However, `TableProvider` doesn't fit 
`DataFrameWriter` well, as the save mode requires more than just get table. 
More specifically, `ErrorIfExists` mode needs to check if table exists, and 
create table. `Ignore` mode needs to check if table exists. When end-users 
specify `ErrorIfExists` or `Ignore` mode and write data to `TableProvider` via 
`DataFrameWriter`, Spark fails the query and asks users to use `Append` or 
`Overwrite` mode.
   
   The file source is in the middle of `TableProvider` and `TableCatalog`: it's 
simple but it can check table(path) exists and create table(path). That said, 
file source supports all the save modes.
   
   Currently file source implements `TableProvider`, and it's not working 
because `TableProvider` doesn't support `ErrorIfExists` and `Ignore` modes. 
Ideally we should create a new API for path-based data sources, but to unblock 
the work of file source v2 migration, this PR proposes to special-case file 
source v2 in `DataFrameWriter`, to make it work.
   
   This PR also removes `SaveMode` from data source v2, as now only the 
internal file source v2 needs it.
   
   ## How was this patch tested?
   
   (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
   (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
   
   Please review http://spark.apache.org/contributing.html before opening a 
pull request.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to