[
https://issues.apache.org/jira/browse/BEAM-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anonymous updated BEAM-12225:
-----------------------------
Status: Triage Needed (was: Resolved)
> Replace AWS API used to list shards from DescribeStream to ListShards
> ---------------------------------------------------------------------
>
> Key: BEAM-12225
> URL: https://issues.apache.org/jira/browse/BEAM-12225
> Project: Beam
> Issue Type: Improvement
> Components: io-java-kinesis
> Affects Versions: 2.27.0
> Reporter: Rafał Ochyra
> Assignee: Rafał Ochyra
> Priority: P2
> Fix For: 2.32.0
>
> Time Spent: 3h
> Remaining Estimate: 0h
>
> We use Google Dataflow with Apache Beam to read data from AWS Kinesis
> streams. We started experiencing problems with the [GetShardIterator
> API|https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html]
> that is used to establish the initial set of active shards to read from a
> given timestamp.
> Some of the shards unexpectedly threw AmazonKinesisExceptions (with error
> code “InternalFailure”) during an attempt to get an iterator for them, which
> caused failures of our Dataflow jobs. It resulted in us looking for potential
> solutions for this problem.
> According to our knowledge, AWS lately modified the way GetShardIterator API
> acts. Instead of throwing exceptions AWS right now will return an iterator
> even for closed shards. Following description of this behavior is available
> in [the
> documentation|https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html]:
> _“If the shard is closed, GetShardIterator returns a valid iterator for the
> last sequence number of the shard. A shard can be closed as a result of using
> SplitShard or MergeShards.”_
> Because of this change Beam will try to read closed shard, find no records
> and try to read its children. If children were created a long time in the
> past the data to read from Kinesis stream might be significantly moved back
> in time and Beam will not start processing data from the expected starting
> point. This causes unnecessary delay and additional costs for processing the
> data.
> In the current implementation, validation of shards using GetShardsIterator
> is required, because [DescribeStream
> API|https://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html],
> used for listing them, returns all shards - not only opened, but closed ones
> as well. AWS recommends to use [ListShards
> API|https://docs.aws.amazon.com/kinesis/latest/APIReference/API_ListShards.html]
> for such use cases. It eliminates not only the problem of thrown exceptions
> but reduces time required to list valid shards. Additional issue with
> DescribeStream is the low transaction limit. It is only 10 transactions per
> second per account, whereas ListShards limit is 100 transactions per second
> per data stream.
> Use of ListShards API instead of DescribeStream API:
> * reduces time required to start reading data (it should address problem
> with long-lasting pipeline creation: BEAM-9759),
> * allows for more transactions per second,
> * introduces API for listing shards recommended by AWS,
> * eliminates the problem with data being read from unexpectedly old shards
> children.
> Potential issues with ListShards API:
> * according to the
> [documentation|https://docs.aws.amazon.com/kinesis/latest/APIReference/API_ListShards.html]
> ListShards API for fine-grained IAM policy might require an update to that
> policy
--
This message was sent by Atlassian Jira
(v8.20.10#820010)