kvudata commented on issue #27156:
URL: https://github.com/apache/beam/issues/27156#issuecomment-1642724845

   Yes, we are using TFRecordIO (specifically `beam.io.WriteToTFRecord`).
   
   > it sounds like your process and/or finish_bundle methods may be not 
idempotent. Idempotency is an important consideration when writing IO code.
   
   Note that we're only logging to Stackdriver aka Google Cloud Logging in our 
DoFn (we gather logs in `process()` and then log the batch in 
`finish_bundle()`), and we're ok with duplicate logs being generated. We're 
observing that Dataflow seems to perform some kind of retry if our 
`finish_bundle()` fails, and the later `WriteToTFRecord` (which doesn't depend 
on this DoFn in our pipeline) ends up writing duplicates.
   
   Another interesting observation is that the metrics for the number of items 
processed / written in the Dataflow UI is the number of elements we would 
expect if duplicates were not being written - the duplicates are only apparent 
from inspecting the output tfrecord(s).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to