[ 
https://issues.apache.org/jira/browse/CAMEL-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340858#comment-17340858
 ] 

Pierre-Yves Bigourdan commented on CAMEL-16594:
-----------------------------------------------

Yes, I did have a look at the code, Camel will probably need to get updates 
from multiple shards. Amongst other things, the fix would involve turning 
single fields or return values into lists or maps. One additional challenge is 
that the existing ddbstream implementation does not have many unit tests.

This issue is not business critical for my company's use case, but I've done 
the digging and analysis as part of "10% time" ([this kind of 
initiative|https://en.wikipedia.org/wiki/20%25_Project] if you're not familiar 
with it). No guarantees, but I'm happy to carry on and attempt a pull request. 
As I'll only dedicate a trickle of effort to this, worth keeping in mind that 
it will take some time to land. :)

> DynamoDB stream updates are missed when there are more than one active shards
> -----------------------------------------------------------------------------
>
>                 Key: CAMEL-16594
>                 URL: https://issues.apache.org/jira/browse/CAMEL-16594
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-aws
>            Reporter: Pierre-Yves Bigourdan
>            Assignee: Andrea Cosentino
>            Priority: Major
>         Attachments: shards.json
>
>
> The current Camel ddbstream implementation seems to incorrectly apply the 
> concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB 
> stream rather than each shard individually.
> According to the [AWS 
> documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
> {noformat}
> ShardIteratorType determines how the shard iterator is used to start reading 
> stream records from the shard.
> {noformat}
> For example, for a given shard, when {{ShardIteratorType}} equal to 
> {{LATEST}}, the AWS SDK will read the most recent data in that particular 
> shard. However, when {{ShardIteratorType}} equal to {{LATEST}}, Camel will 
> additionally use {{ShardIteratorType}} to determine which shard it considers 
> amongst all the available ones in the stream: 
> https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132
> If my understanding is correct, shards in DynamoDB are modelled as a tree, 
> with the child leaf nodes being the shards that are still active, i.e. the 
> ones where new stream data will appear. These child shards will have a 
> {{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.
> The most common case is to have a single shard, or a single branch of parent 
> and child nodes:
> {noformat}
> Shard1
>    |
> Shard2
> {noformat}
> In the above case, new data will be added to {{Shard2}}, and the Camel 
> implementation which  looks only at the last shard when {{ShardIteratorType}} 
> is equal to {{LATEST}}, will be correct.
> However, the tree can also look like this (see related example in the 
> attached JSON output from the AWS CLI):
> {noformat}
>              Shard1
>             /      \
>      Shard2          Shard3
>     /      \        /      \ 
> Shard4   Shard5  Shard6   Shard7
> {noformat}
> In this case, Camel will only consider Shard7, even though new data may be 
> added to any of Shard4, Shard5, Shard6 or Shard7. This leads to updates being 
> missed.
> As far as I can tell, DynamoDB will split into multiple shards depending on 
> the number of table partitions, which will either grow for a table with huge 
> amounts of data, or when an exiting table with provisioned capacity is 
> migrated to on-demand provisioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to