GitHub user zentol opened a pull request:
https://github.com/apache/flink/pull/1640
[FLINK-3332] Add Exactly-Once Cassandra connector
2nd attempt to add an Exactly-Once Cassandra Sink. I've addressed all
issues brought up in the last PR bar one: This sink only works with Tuples. A
simpler Cassandra Sink is in the works (see FLINK-3311) that will feature POJO
support and i intend to copy the code-paste that code once it's done.
The Exactly-once guarantee is made by saving incoming records in the
OperatorState, and only committing them into Cassandra when a checkpoint
completes. Whether a operator committed data is saved using a new
CheckpointCommitter object, that saves this information in an external and
retry-persistent resource. Note that a job failure while the data is being
committed will cause duplicate data to be committed, but the chance of this
happening is much smaller than for a naive At-Least-once implementation.
The CassandraExactlyOnceSink is implemented as a custom operator to get
access to the Statebackend. Values are committed with single inserts using a
PreparedStatement that is supplied by the user, similiar to the Batch
JDBC-Outputformat.
The Exactly-Once logic is completely contained in a GenericExactlyOnceSink
class that can be used by virtually every sink, requiring no knowledge about
the checkpointing mechamism.
The CassandraExactlyOnceSink and GenericExactlyOnceSink are covered by
tests that use the OneInputStreamTaskHarness to generate a task environment,
verifying that stored data is discarded when a state is restored; all data is
being committed when a notify is missed; and of course that everything works
when nothing goes wrong.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zentol/flink 3332_cassandra
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1640.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1640
----
commit 4b91168dd60680502bf50f00e49b94d190f2d601
Author: zentol <[email protected]>
Date: 2016-02-10T13:14:18Z
[FLINK-3332] Add Exactly-Once Cassandra connector
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---