[ 
https://issues.apache.org/jira/browse/CAMEL-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrea Cosentino resolved CAMEL-16594.
--------------------------------------
    Resolution: Fixed

> DynamoDB stream updates are missed when there are more than one active shards
> -----------------------------------------------------------------------------
>
>                 Key: CAMEL-16594
>                 URL: https://issues.apache.org/jira/browse/CAMEL-16594
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-aws
>            Reporter: Pierre-Yves Bigourdan
>            Assignee: Andrea Cosentino
>            Priority: Major
>             Fix For: 3.12.0
>
>         Attachments: shards.json
>
>
> The current Camel ddbstream implementation seems to incorrectly apply the 
> concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB 
> stream rather than each shard individually.
> According to the [AWS 
> documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
> {noformat}
> ShardIteratorType determines how the shard iterator is used to start reading 
> stream records from the shard.
> {noformat}
> For example, for a given shard, when {{ShardIteratorType}} equal to 
> {{LATEST}}, the AWS SDK will read the most recent data in that particular 
> shard. However, when {{ShardIteratorType}} equal to {{LATEST}}, Camel will 
> additionally use {{ShardIteratorType}} to determine which shard it considers 
> amongst all the available ones in the stream: 
> https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132
> If my understanding is correct, shards in DynamoDB are modelled as a tree, 
> with the child leaf nodes being the shards that are still active, i.e. the 
> ones where new stream data will appear. These child shards will have a 
> {{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.
> The most common case is to have a single shard, or a single branch of parent 
> and child nodes:
> {noformat}
> Shard0
>    |
> Shard1
> {noformat}
> In the above case, new data will be added to {{Shard1}}, and the Camel 
> implementation which  looks only at the last shard when {{ShardIteratorType}} 
> is equal to {{LATEST}}, will be correct.
> However, the tree can also look like this (see related example in the 
> attached JSON output from the AWS CLI, where the shard number matches the 
> index in the JSON list):
> {noformat}
>              Shard0
>             /      \
>      Shard1          Shard2
>     /      \        /      \ 
> Shard3   Shard4  Shard5   Shard6
> {noformat}
> In this case, Camel will only consider Shard6, even though new data may be 
> added to any of Shard3, Shard4, Shard5 or Shard6. This leads to updates being 
> missed.
> As far as I can tell, DynamoDB will split into multiple shards depending on 
> the number of table partitions, which will either grow for a table with huge 
> amounts of data, or when an exiting table with provisioned capacity is 
> migrated to on-demand provisioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to