[jira] [Updated] (FLINK-3211) Add AWS Kinesis streaming connector

Tzu-Li (Gordon) Tai (JIRA) Sat, 09 Jan 2016 05:44:19 -0800

     [ 
https://issues.apache.org/jira/browse/FLINK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tzu-Li (Gordon) Tai updated FLINK-3211:
---------------------------------------
    Description: 
AWS Kinesis is a widely adopted message queue used by AWS users, much like a 
cloud service version of Apache Kafka. Support for AWS Kinesis will be a great 
addition to the handful of Flink's streaming connectors to external systems and 
a great reach out to the AWS community.

AWS supports two different ways to consume Kinesis data: with the low-level AWS 
SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK can 
be used to consume Kinesis data, including stream read beginning from a 
specific offset (or "record sequence number" in Kinesis terminology). On the 
other hand, AWS officially recommends using KCL, which offers a higher-level of 
abstraction that also comes with checkpointing and failure recovery by using a 
KCL-managed AWS DynamoDB table as the checkpoint state storage.

However, KCL is essentially a stream processing library that wraps all the 
partition-to-task (or "shard" in Kinesis terminology) determination and 
checkpointing to allow the user to focus only on streaming application logic. 
This leads to the understanding that we can not use the KCL to implement the 
Kinesis streaming connector if we are aiming for a deep integration of Flink 
with Kinesis that provides exactly once guarantees (KCL promises only 
at-least-once). Therefore, AWS SDK will be the way to go for the implementation 
of this feature.

With the ability to read from specific offsets, and also the fact that Kinesis 
and Kafka share a lot of similarities, the basic principles of the 
implementation of Flink's Kinesis streaming connector will very much resemble 
the Kafka connector. We can basically follow the outlines described in 
[~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector 
implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka differences 
is described as following:

1. While the Kafka connector can support reading from multiple topics, I 
currently don't think this is a good idea for Kinesis streams (a Kinesis Stream 
is logically equivalent to a Kafka topic). Kinesis streams can exist in 
different AWS regions, and each Kinesis stream under the same AWS user account 
may have completely independent access settings with different authorization 
keys. Overall, a Kinesis stream feels like a much more consolidated resource 
compared to Kafka topics. It would be great to hear more thoughts on this part.

2. While Kafka has brokers that can hold multiple partitions, the only 
partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast to 
the Kafka connector having per broker connections where the connections can 
handle multiple Kafka partitions, the Kinesis connector will only need to have 
simple per shard connections.

3. Kinesis itself does not support committing offsets back to Kinesis. If we 
were to implement this feature like the Kafka connector with Kafka / ZK, we 
probably could use ZK or DynamoDB like the way KCL works. More thoughts on this 
part will be very helpful too.


As for the Kinesis Sink, it should be possible to use the AWS KPL (Kinesis 
Producer Library) [5]. However, for higher code consistency with the proposed 
Kinesis Consumer, I think it will be better to still go with the AWS SDK for 
the implementation. The implementation should be straight forward, being almost 
if not completely the same as the Kafka sink.

References:
[1] 
http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-sdk.html
[2] 
http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
[3] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connector-td2872.html
[4] http://data-artisans.com/kafka-flink-a-practical-how-to/
[5] 
http://docs.aws.amazon.com//kinesis/latest/dev/developing-producers-with-kpl.html#d0e4998

  was:
AWS Kinesis is a widely adopted message queue used by AWS users, much like a 
cloud service version of Apache Kafka. Support for AWS Kinesis will be a great 
addition to the handful of Flink's streaming connectors to external systems and 
a great reach out to the AWS community.

AWS supports two different ways to consume Kinesis data: with the low-level AWS 
SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK can 
be used to consume Kinesis data, including stream read beginning from a 
specific offset (or "record sequence number" in Kinesis terminology). On the 
other hand, AWS officially recommends using KCL, which offers a higher-level of 
abstraction that also comes with checkpointing and failure recovery by using a 
KCL-managed AWS DynamoDB table as the checkpoint state storage.

However, KCL is essentially a stream processing library that wraps all the 
partition-to-task (or "shard" in Kinesis terminology) determination and 
checkpointing to allow the user to focus only on streaming application logic. 
This leads to the understanding that we can not use the KCL to implement the 
Kinesis streaming connector if we are aiming for a deep integration of Flink 
with Kinesis that provides exactly once guarantees (KCL promises only 
at-least-once). Therefore, AWS SDK will be the way to go for the implementation 
of this feature.

With the ability to read from specific offsets, and also the fact that Kinesis 
and Kafka share a lot of similarities, the basic principles of the 
implementation of Flink's Kinesis streaming connector will very much resemble 
the Kafka connector. We can basically follow the outlines described in 
[~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector 
implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka differences 
is described as following:

1. While the Kafka connector can support reading from multiple topics, I 
currently don't think this is a good idea for Kinesis streams (a Kinesis Stream 
is logically equivalent to a Kafka topic). Kinesis streams can exist in 
different AWS regions, and each Kinesis stream under the same AWS user account 
may have completely independent access settings with different authorization 
keys. Overall, a Kinesis stream feels like a much more consolidated resource 
compared to Kafka topics. It would be great to here more thoughts on this part.

2. While Kafka has brokers that can hold multiple partitions, the only 
partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast to 
the Kafka connector having per broker connections where the connections can 
handle multiple Kafka partitions, the Kinesis connector will only need to have 
simple per shard connections.

3. Kinesis itself does not support committing offsets back to Kinesis. If we 
were to implement this feature like the Kafka connector with Kafka / ZK, we 
probably could use ZK or DynamoDB like the way KCL works. More thoughts on this 
part will be very helpful too.


As for the Kinesis Sink, it should be possible to use the AWS KPL (Kinesis 
Producer Library) [5]. However, for higher code consistency with the proposed 
Kinesis Consumer, I think it will be better to still go with the AWS SDK for 
the implementation. The implementation should be straight forward, being almost 
if not completely the same as the Kafka sink.

References:
[1] 
http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-sdk.html
[2] 
http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
[3] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connector-td2872.html
[4] http://data-artisans.com/kafka-flink-a-practical-how-to/
[5] 
http://docs.aws.amazon.com//kinesis/latest/dev/developing-producers-with-kpl.html#d0e4998


> Add AWS Kinesis streaming connector
> -----------------------------------
>
>                 Key: FLINK-3211
>                 URL: https://issues.apache.org/jira/browse/FLINK-3211
>             Project: Flink
>          Issue Type: New Feature
>          Components: Streaming Connectors
>    Affects Versions: 1.0.0
>            Reporter: Tzu-Li (Gordon) Tai
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> AWS Kinesis is a widely adopted message queue used by AWS users, much like a 
> cloud service version of Apache Kafka. Support for AWS Kinesis will be a 
> great addition to the handful of Flink's streaming connectors to external 
> systems and a great reach out to the AWS community.
> AWS supports two different ways to consume Kinesis data: with the low-level 
> AWS SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK 
> can be used to consume Kinesis data, including stream read beginning from a 
> specific offset (or "record sequence number" in Kinesis terminology). On the 
> other hand, AWS officially recommends using KCL, which offers a higher-level 
> of abstraction that also comes with checkpointing and failure recovery by 
> using a KCL-managed AWS DynamoDB table as the checkpoint state storage.
> However, KCL is essentially a stream processing library that wraps all the 
> partition-to-task (or "shard" in Kinesis terminology) determination and 
> checkpointing to allow the user to focus only on streaming application logic. 
> This leads to the understanding that we can not use the KCL to implement the 
> Kinesis streaming connector if we are aiming for a deep integration of Flink 
> with Kinesis that provides exactly once guarantees (KCL promises only 
> at-least-once). Therefore, AWS SDK will be the way to go for the 
> implementation of this feature.
> With the ability to read from specific offsets, and also the fact that 
> Kinesis and Kafka share a lot of similarities, the basic principles of the 
> implementation of Flink's Kinesis streaming connector will very much resemble 
> the Kafka connector. We can basically follow the outlines described in 
> [~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector 
> implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka 
> differences is described as following:
> 1. While the Kafka connector can support reading from multiple topics, I 
> currently don't think this is a good idea for Kinesis streams (a Kinesis 
> Stream is logically equivalent to a Kafka topic). Kinesis streams can exist 
> in different AWS regions, and each Kinesis stream under the same AWS user 
> account may have completely independent access settings with different 
> authorization keys. Overall, a Kinesis stream feels like a much more 
> consolidated resource compared to Kafka topics. It would be great to hear 
> more thoughts on this part.
> 2. While Kafka has brokers that can hold multiple partitions, the only 
> partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast 
> to the Kafka connector having per broker connections where the connections 
> can handle multiple Kafka partitions, the Kinesis connector will only need to 
> have simple per shard connections.
> 3. Kinesis itself does not support committing offsets back to Kinesis. If we 
> were to implement this feature like the Kafka connector with Kafka / ZK, we 
> probably could use ZK or DynamoDB like the way KCL works. More thoughts on 
> this part will be very helpful too.
> As for the Kinesis Sink, it should be possible to use the AWS KPL (Kinesis 
> Producer Library) [5]. However, for higher code consistency with the proposed 
> Kinesis Consumer, I think it will be better to still go with the AWS SDK for 
> the implementation. The implementation should be straight forward, being 
> almost if not completely the same as the Kafka sink.
> References:
> [1] 
> http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-sdk.html
> [2] 
> http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
> [3] 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connector-td2872.html
> [4] http://data-artisans.com/kafka-flink-a-practical-how-to/
> [5] 
> http://docs.aws.amazon.com//kinesis/latest/dev/developing-producers-with-kpl.html#d0e4998



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-3211) Add AWS Kinesis streaming connector

Reply via email to