In the case of failure, a DoFn instance will not be reused; however, in the case of failure either the inputs will be retried, or the pipeline will fail, allowing a newly deserialized instance of the DoFn to reprocess the inputs (which should produce the same result, meaning there is no data loss).
On Wed, Jun 8, 2016 at 10:29 AM, Raghu Angadi <rang...@google.com.invalid> wrote: > On Wed, Jun 8, 2016 at 10:13 AM, Ben Chambers <bchamb...@google.com.invalid > > > wrote: > > > - If failure occurs after finishBundle() but before the consumption is > > committed, then the bundle may be reprocessed, which leads to duplicated > > calls to processElement() and finishBundle(). > > > > > > > - If failure occurs after consumption is committed but before > > finishBundle(), then those elements which may have buffered state in the > > DoFn but not had their side-effects fully processed (since the > > finishBundle() was responsible for that) are lost. > > > > I am trying to understand this better. Does this mean during > recovery/replay after a failure, the particular instance of DoFn that > existed before the worker failure would not be discarded, but might still > receive elements? If a DoFn is caching some internal state, it should > always assume the worker its on might abruptly fail anytime and the state > would be lost, right? >