[
https://issues.apache.org/jira/browse/FLINK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275162#comment-15275162
]
Robert Metzger commented on FLINK-3211:
---------------------------------------
That's a good idea.
I think we need to figure out how hard it would be to implement a producer
without using the producer library.
> Add AWS Kinesis streaming connector
> -----------------------------------
>
> Key: FLINK-3211
> URL: https://issues.apache.org/jira/browse/FLINK-3211
> Project: Flink
> Issue Type: New Feature
> Components: Streaming Connectors
> Affects Versions: 1.1.0
> Reporter: Tzu-Li (Gordon) Tai
> Assignee: Tzu-Li (Gordon) Tai
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> AWS Kinesis is a widely adopted message queue used by AWS users, much like a
> cloud service version of Apache Kafka. Support for AWS Kinesis will be a
> great addition to the handful of Flink's streaming connectors to external
> systems and a great reach out to the AWS community.
> AWS supports two different ways to consume Kinesis data: with the low-level
> AWS SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK
> can be used to consume Kinesis data, including stream read beginning from a
> specific offset (or "record sequence number" in Kinesis terminology). On the
> other hand, AWS officially recommends using KCL, which offers a higher-level
> of abstraction that also comes with checkpointing and failure recovery by
> using a KCL-managed AWS DynamoDB "leash table" as the checkpoint state
> storage.
> However, KCL is essentially a stream processing library that wraps all the
> partition-to-task (or "shard" in Kinesis terminology) determination and
> checkpointing to allow the user to focus only on streaming application logic.
> This leads to the understanding that we can not use the KCL to implement the
> Kinesis streaming connector if we are aiming for a deep integration of Flink
> with Kinesis that provides exactly once guarantees (KCL promises only
> at-least-once). Therefore, AWS SDK will be the way to go for the
> implementation of this feature.
> With the ability to read from specific offsets, and also the fact that
> Kinesis and Kafka share a lot of similarities, the basic principles of the
> implementation of Flink's Kinesis streaming connector will very much resemble
> the Kafka connector. We can basically follow the outlines described in
> [~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector
> implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka
> differences is described as following:
> 1. While the Kafka connector can support reading from multiple topics, I
> currently don't think this is a good idea for Kinesis streams (a Kinesis
> Stream is logically equivalent to a Kafka topic). Kinesis streams can exist
> in different AWS regions, and each Kinesis stream under the same AWS user
> account may have completely independent access settings with different
> authorization keys. Overall, a Kinesis stream feels like a much more
> consolidated resource compared to Kafka topics. It would be great to hear
> more thoughts on this part.
> 2. While Kafka has brokers that can hold multiple partitions, the only
> partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast
> to the Kafka connector having per broker connections where the connections
> can handle multiple Kafka partitions, the Kinesis connector will only need to
> have simple per shard connections.
> 3. Kinesis itself does not support committing offsets back to Kinesis. If we
> were to implement this feature like the Kafka connector with Kafka / ZK to
> sync outside view of progress, we probably could use ZK or DynamoDB like the
> way KCL works. More thoughts on this part will be very helpful too.
> As for the Kinesis Sink, it should be possible to use the AWS KPL (Kinesis
> Producer Library) [5]. However, for higher code consistency with the proposed
> Kinesis Consumer, I think it will be better to stick with the AWS SDK for the
> implementation. The implementation should be straight forward, being almost
> if not completely the same as the Kafka sink.
> References:
> [1]
> http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-sdk.html
> [2]
> http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
> [3]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connector-td2872.html
> [4] http://data-artisans.com/kafka-flink-a-practical-how-to/
> [5]
> http://docs.aws.amazon.com//kinesis/latest/dev/developing-producers-with-kpl.html#d0e4998
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)