In the case of failure, a DoFn instance will not be reused; however, in the
case of failure either the inputs will be retried, or the pipeline will
fail, allowing a newly deserialized instance of the DoFn to reprocess the
inputs (which should produce the same result, meaning there is no data
loss).

On Wed, Jun 8, 2016 at 10:29 AM, Raghu Angadi <rang...@google.com.invalid>
wrote:

> On Wed, Jun 8, 2016 at 10:13 AM, Ben Chambers <bchamb...@google.com.invalid
> >
> wrote:
>
> > - If failure occurs after finishBundle() but before the consumption is
> > committed, then the bundle may be reprocessed, which leads to duplicated
> > calls to processElement() and finishBundle().
> >
>
>
>
> > - If failure occurs after consumption is committed but before
> > finishBundle(), then those elements which may have buffered state in the
> > DoFn but not had their side-effects fully processed (since the
> > finishBundle() was responsible for that) are lost.
> >
>
> I am trying to understand this better. Does this mean during
> recovery/replay after a failure, the particular instance of DoFn that
> existed before the worker failure would not be discarded, but might still
> receive elements?  If a DoFn is caching some internal state, it should
> always assume the worker its on might abruptly fail anytime and the state
> would be lost, right?
>

Reply via email to