[GitHub] [beam] dennisylyung commented on pull request #12583: [BEAM-10706] Fix duplicate key error in DynamoDBIO.Write

GitBox Tue, 08 Dec 2020 17:44:51 -0800


dennisylyung commented on pull request #12583:
URL: https://github.com/apache/beam/pull/12583#issuecomment-741419939



   In the current implementation `private List<KV<String, WriteRequest>> 
batch`, the key is the table name, not the primary-key. 
   
   for example, in a table `user`, the primary key is `id`. A batch entry would 
be like this: 
   `KV("user", {id=1, name=Chris, age=30})`
   We have no way to know that `id` is the key we need to deduplicate on 
without users specifying. 
   
   Theoretically, operating with a DynamoDB should not require setting the keys 
for de-duplication, since repeated write to the same key will just update the 
value. However, the current implementation of the DynamoDB batch put API 
requires no duplicate keys within a batch. Hence, users need to explicitly set 
the overwrite keys. 
   
   You are right that the overwrite keys are necessary to completely avoid 
`ValidationError`. As long as the sink operate in upsert logic (i.e. the data 
could contain duplicate keys), there is a risk of the same keys going into a 
single batch. This is also the problem I face developing pipelines with 
DynamoDB sinks.
   
   There is one special case though. If the user is very sure that the keys 
will never have duplicates, such as when their pipelines are logically 
append-only, they will not encounter `ValidationError`. In which case, 
requiring them to specify the keys could be unnecessary. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] dennisylyung commented on pull request #12583: [BEAM-10706] Fix duplicate key error in DynamoDBIO.Write

Reply via email to