dennisylyung commented on pull request #12583:
URL: https://github.com/apache/beam/pull/12583#issuecomment-741419939
In the current implementation `private List<KV<String, WriteRequest>>
batch`, the key is the table name, not the primary-key.
for example, in a table `user`, the primary key is `id`. A batch entry would
be like this:
`KV("user", {id=1, name=Chris, age=30})`
We have no way to know that `id` is the key we need to deduplicate on
without users specifying.
Theoretically, operating with a DynamoDB should not require setting the keys
for de-duplication, since repeated write to the same key will just update the
value. However, the current implementation of the DynamoDB batch put API
requires no duplicate keys within a batch. Hence, users need to explicitly set
the overwrite keys.
You are right that the overwrite keys are necessary to completely avoid
`ValidationError`. As long as the sink operate in upsert logic (i.e. the data
could contain duplicate keys), there is a risk of the same keys going into a
single batch. This is also the problem I face developing pipelines with
DynamoDB sinks.
There is one special case though. If the user is very sure that the keys
will never have duplicates, such as when their pipelines are logically
append-only, they will not encounter `ValidationError`. In which case,
requiring them to specify the keys could be unnecessary.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]