[ 
https://issues.apache.org/jira/browse/CAMEL-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre-Yves Bigourdan updated CAMEL-16594:
------------------------------------------
    Description: 
The current Camel ddbstream implementation seems to incorrectly apply the 
concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB 
stream rather than each shard individually.

According to the [AWS 
documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
{noformat}
ShardIteratorType determines how the shard iterator is used to start reading 
stream records from the shard.
{noformat}

For example, for a given shard, when {{ShardIteratorType}} equal to {{LATEST}}, 
the AWS SDK will read the most recent data in that particular shard. However, 
when {{ShardIteratorType}} equal to {{LATEST}}, Camel will additionally use 
{{ShardIteratorType}} to determine which shard it considers amongst all the 
available ones in the stream: 
https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132

If my understanding is correct, shards in DynamoDB are modelled as a tree, with 
the child leaf nodes being the shards that are still active, i.e. the ones 
where new stream data will appear. These child shards will have a 
{{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.

The most common case is to have a single shard, or a single branch of parent 
and child nodes:
{noformat}
Shard0
   |
Shard1
{noformat}

In the above case, new data will be added to {{Shard1}}, and the Camel 
implementation which  looks only at the last shard when {{ShardIteratorType}} 
is equal to {{LATEST}}, will be correct.

However, the tree can also look like this (see related example in the attached 
JSON output from the AWS CLI):
{noformat}
             Shard0
            /      \
     Shard1          Shard2
    /      \        /      \ 
Shard3   Shard4  Shard5   Shard6
{noformat}
In this case, Camel will only consider Shard6, even though new data may be 
added to any of Shard3, Shard4, Shard5 or Shard6. This leads to updates being 
missed.

As far as I can tell, DynamoDB will split into multiple shards depending on the 
number of table partitions, which will either grow for a table with huge 
amounts of data, or when an exiting table with provisioned capacity is migrated 
to on-demand provisioning.

  was:
The current Camel ddbstream implementation seems to incorrectly apply the 
concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB 
stream rather than each shard individually.

According to the [AWS 
documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
{noformat}
ShardIteratorType determines how the shard iterator is used to start reading 
stream records from the shard.
{noformat}

For example, for a given shard, when {{ShardIteratorType}} equal to {{LATEST}}, 
the AWS SDK will read the most recent data in that particular shard. However, 
when {{ShardIteratorType}} equal to {{LATEST}}, Camel will additionally use 
{{ShardIteratorType}} to determine which shard it considers amongst all the 
available ones in the stream: 
https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132

If my understanding is correct, shards in DynamoDB are modelled as a tree, with 
the child leaf nodes being the shards that are still active, i.e. the ones 
where new stream data will appear. These child shards will have a 
{{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.

The most common case is to have a single shard, or a single branch of parent 
and child nodes:
{noformat}
Shard1
   |
Shard2
{noformat}

In the above case, new data will be added to {{Shard2}}, and the Camel 
implementation which  looks only at the last shard when {{ShardIteratorType}} 
is equal to {{LATEST}}, will be correct.

However, the tree can also look like this (see related example in the attached 
JSON output from the AWS CLI):
{noformat}
             Shard1
            /      \
     Shard2          Shard3
    /      \        /      \ 
Shard4   Shard5  Shard6   Shard7
{noformat}
In this case, Camel will only consider Shard7, even though new data may be 
added to any of Shard4, Shard5, Shard6 or Shard7. This leads to updates being 
missed.

As far as I can tell, DynamoDB will split into multiple shards depending on the 
number of table partitions, which will either grow for a table with huge 
amounts of data, or when an exiting table with provisioned capacity is migrated 
to on-demand provisioning.


> DynamoDB stream updates are missed when there are more than one active shards
> -----------------------------------------------------------------------------
>
>                 Key: CAMEL-16594
>                 URL: https://issues.apache.org/jira/browse/CAMEL-16594
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-aws
>            Reporter: Pierre-Yves Bigourdan
>            Assignee: Andrea Cosentino
>            Priority: Major
>         Attachments: shards.json
>
>
> The current Camel ddbstream implementation seems to incorrectly apply the 
> concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB 
> stream rather than each shard individually.
> According to the [AWS 
> documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
> {noformat}
> ShardIteratorType determines how the shard iterator is used to start reading 
> stream records from the shard.
> {noformat}
> For example, for a given shard, when {{ShardIteratorType}} equal to 
> {{LATEST}}, the AWS SDK will read the most recent data in that particular 
> shard. However, when {{ShardIteratorType}} equal to {{LATEST}}, Camel will 
> additionally use {{ShardIteratorType}} to determine which shard it considers 
> amongst all the available ones in the stream: 
> https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132
> If my understanding is correct, shards in DynamoDB are modelled as a tree, 
> with the child leaf nodes being the shards that are still active, i.e. the 
> ones where new stream data will appear. These child shards will have a 
> {{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.
> The most common case is to have a single shard, or a single branch of parent 
> and child nodes:
> {noformat}
> Shard0
>    |
> Shard1
> {noformat}
> In the above case, new data will be added to {{Shard1}}, and the Camel 
> implementation which  looks only at the last shard when {{ShardIteratorType}} 
> is equal to {{LATEST}}, will be correct.
> However, the tree can also look like this (see related example in the 
> attached JSON output from the AWS CLI):
> {noformat}
>              Shard0
>             /      \
>      Shard1          Shard2
>     /      \        /      \ 
> Shard3   Shard4  Shard5   Shard6
> {noformat}
> In this case, Camel will only consider Shard6, even though new data may be 
> added to any of Shard3, Shard4, Shard5 or Shard6. This leads to updates being 
> missed.
> As far as I can tell, DynamoDB will split into multiple shards depending on 
> the number of table partitions, which will either grow for a table with huge 
> amounts of data, or when an exiting table with provisioned capacity is 
> migrated to on-demand provisioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to