Hi Eron,
Very interesting idea to support exactly once semantics for sinks via
Git! I would be curious about the performance of such a sink.
Since this currently works on local file systems only (throws an
Exception otherwise), I wonder how does it work on failures when the
"git-${subtaskIndex}" directory is not available on a node? We might
loose some of the exactly once semantics because the task deployment
is not deterministic.
Nevertheless, very elegant hack!
Cheers,
Max
On Sat, Apr 23, 2016 at 12:23 AM, Eron Wright <[email protected]> wrote:
> Hello,
> On a long plane trip I had some fun with writing a Flink streaming connector
> based on Git. https://github.com/EronWright/flink-git
> Not intended for real application use; flink-git is just an experiment meant
> for discussion.
> Flink's Kafka connector provides exactly-once guarantees when acting as a
> source (consumer) but not as a sink (producer), due to a limitation of Kafka.
> This limitation invites the question of how to extend Kafka (or a similar
> system) to provide exactly-once guarantees for a sink. Since Kafka is
> envisioned as a commit log, may an answer be found in commit log concepts?
> The flink-git repository explores that possibility.
> Git provides a useful conceptual framework for the investigation, since its
> concepts are familiar and it is easily programmable with jgit. The flink-git
> repository is thus an experimental connector, based on jgit, that explores
> providing exactly-once guarantees as both a source and as a sink.
> Enjoy,Eron Wright
>