Pierre-Yves Bigourdan created CAMEL-16594:
---------------------------------------------
Summary: DynamoDB stream updates are missed when there are more
than one active shards
Key: CAMEL-16594
URL: https://issues.apache.org/jira/browse/CAMEL-16594
Project: Camel
Issue Type: Bug
Components: camel-aws
Reporter: Pierre-Yves Bigourdan
Attachments: shards.json
The current Camel ddbstream implementation seems to incorrectly apply the
concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB
stream rather than each shard individually.
According to the [AWS
documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
{noformat}
ShardIteratorType determines how the shard iterator is used to start reading
stream records from the shard.
{noformat}
For example, for a given shard, when {{ShardIteratorType}} equal to {{LATEST}},
the AWS SDK will read the most recent data in that particular shard. However,
when {{ShardIteratorType}} equal to {{LATEST}}, Camel will additionally use
{{ShardIteratorType}} to determine which shard it considers amongst all the
available ones in the stream:
https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132
If my understanding is correct, shards in DynamoDB are modelled as a tree, with
the child leaf nodes being the shards that are still active, i.e. the ones
where new stream data will appear. These child shards will have a
{{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.
The most common case is to have a single shard, or a single branch of parent
and child nodes:
{noformat}
Shard1
|
Shard2
{noformat}
In the above case, new data will be added to {{Shard2}}, and the Camel
implementation which looks only at the last shard when {{ShardIteratorType}}
is equal to {{LATEST}}, will be correct.
However, the tree can also look like this (see related example in the attached
JSON output from the AWS CLI):
{noformat}
Shard1
/ \
Shard2 Shard3
/ \ / \
Shard4 Shard5 Shard6 Shard7
{noformat}
In this case, Camel will only consider Shard7, even though new data may be
added to any of Shard4, Shard5, Shard6 or Shard7. This leads to updates being
missed.
As far as I can tell, DynamoDB will split into multiple shards depending on the
number of table partitions, which will either grow for a table with huge
amounts of data, or when an exiting table with provisioned capacity is migrated
to on-demand provisioning.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)