GitHub user GaalDornick opened a pull request:
https://github.com/apache/spark/pull/17190
[SPARK-19478][SS] JDBC Sink [WIP]
## What changes were proposed in this pull request?
Implementation of Sink that supports storing structured streaming data into
a JDBC compliant RDBMS database. It supports Overwrite and Append modes. By
default it supports _atleast once_ operations and can be configured to support
_exactly once_
To keep track of batches that have been written to a table, it creates a
_log_ table with the name <tablename>$_SINK_LOG. This table has 2 columns:
batchID and status of batch. The status can either be COMMITTED or UNCOMMITTED.
When JDBC Sink receives a batch it checks if there is an entry in the sink log
table for that batch with status = COMMITTED. If status is COMMITTED, it
ignores the batch, other wise it tries the append/overwrite operation
To enable _exactly once_ the client should create a column in the original
table that stores the batchID. This column should be of LongType. The name of
the column should be passed in the options with the name _batchIdCol_. If the
JDBC Sink finds that this option is set, it will use _exactly once_ mode. In
this mode, it will set the _batchIdCol_ to the batch id that is inserting or
overwriting the record. Also, in the beginning of the batch, if it finds a
batch with status=UNCOMMITTED, it deletes the records in the original table
that match the batchID
## How was this patch tested?
Implemented JDBCSinkSuite that is modeled along the lines of other Sink
tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/GaalDornick/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17190.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17190
----
commit 28c8bebadbb7a800c94ba7321af7d144d4678e73
Author: Jayesh Lalwani <[email protected]>
Date: 2017-02-26T07:13:39Z
Implemented JDBCSink
commit f838c4974d435cc19b7589e198d152b362227959
Author: Jayesh Lalwani <[email protected]>
Date: 2017-02-26T07:15:07Z
Merge remote-tracking branch 'upstream/master'
commit 7ac0d7899c06e7f35a3253ee57fa31f30aa946a4
Author: Jayesh Lalwani <[email protected]>
Date: 2017-02-28T13:26:49Z
Formatting code
commit 12086becb1ab882738349b5bb959b4b536832f12
Author: Jayesh Lalwani <[email protected]>
Date: 2017-02-28T14:04:05Z
Merge remote-tracking branch 'upstream/master'
commit 2a43d29a329afa27f4238d61c681fa918cd84d40
Author: Jayesh Lalwani <[email protected]>
Date: 2017-03-01T13:04:26Z
Merge remote-tracking branch 'upstream/master'
commit 756ea2cb32c8a85ccb98cd84d85962e1b5d37154
Author: Jayesh Lalwani <[email protected]>
Date: 2017-03-06T14:36:23Z
Merge remote-tracking branch 'upstream/master'
commit dde8b0b15f11c4e19361e8af485c113ef1a5b422
Author: Jayesh Lalwani <[email protected]>
Date: 2017-03-07T13:35:48Z
Merge remote-tracking branch 'upstream/master'
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]