GitHub user gengliangwang opened a pull request:
https://github.com/apache/spark/pull/20454
[SPARK-23202][SQL] Add new DataSourceWriter API: onDataWriterCommit
## What changes were proposed in this pull request?
Currently, the api `DataSourceV2Writer#commit(WriterCommitMessage[])`
commits a
writing job with a list of commit messages.
It makes sense in some scenarios, e.g. MicroBatchExecution.
However, the API makes it hard to implement `onTaskCommit(taskCommit:
TaskCommitMessage)` in `FileCommitProtocol`.
In general, on receiving commit message, driver can start processing
messages(e.g. persist messages into files) before all the messages are
collected.
The proposal to add a new API:
`add(WriterCommitMessage message)`: Handles a commit message on receiving
from a successful data writer.
This should make the whole API of DataSourceWriter compatible with
`FileCommitProtocol`, and more flexible.
There was another radical attempt in #20386. This one should be more
reasonable.
## How was this patch tested?
Unit test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gengliangwang/spark write_api
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20454.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20454
----
commit 04edec2221a252ccfbcaf9e505eaae0a0f1664ab
Author: Wang Gengliang <ltnwgl@...>
Date: 2018-01-31T08:21:18Z
new DataSourceWriter api: onDataWriterCommit
commit 89776eced1b60b1856d6157a30ad1d8be0ba0f81
Author: Wang Gengliang <ltnwgl@...>
Date: 2018-01-31T12:39:13Z
revise comments and add test case
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]