Yes and no.

Window 30 could be committed and we could restore to the corresponding
checkpoint. The corresponding checkpoint for window 30 is taken after
window 30 is done, but before window 31 has begun. So when we restore to
the checkpoint we will not redo window 30, we will start at window 31. Once
a window id is committed it will never be redone by any operator.

On Tue, Dec 15, 2015 at 2:04 PM, Ashwin Chandra Putta <
[email protected]> wrote:

> Siyuan,
>
> Yes, we are discussing at least once semantics.
>
> Tim,
>
> So it is indeed possible that we recover at committed window id in a case
> where we just committed and there were no further checkpoints before
> failure.
>
> Regards,
> Ashwin.
>
> On Tue, Dec 15, 2015 at 1:54 PM, Timothy Farkas <[email protected]>
> wrote:
>
> > Whoops my bad, that would never happen. There is a check that only allows
> > purging of checkpoints for an operator if the operator has more than one
> > checkpoint. :)
> >
> > On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <[email protected]>
> > wrote:
> >
> > > Siyuan, then Ashwin may be right that there is an issue. Looking at the
> > > code again I think this could happen:
> > >
> > > 1 - All operators reach checkpiont 30
> > > 2 - Checkpoints are updated on heartbeat and committed window is now
> 25,
> > > everything before window 30 is purged
> > > 3 - no new checkpoint is reached for any operator
> > > 4 - Checkpoints are updated on heartbeat again and committed window is
> > now
> > > 30, now window 30 is purged.
> > >
> > > May be missing something again though.
> > >
> > > On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <[email protected]>
> > > wrote:
> > >
> > >> My understanding is the committed window could possibly be 30 as well,
> > >> depends on whether container manager get heart beat from containers.
> > >>
> > >> And I guess the discussion is assuming at_least_once semantic? :)
> > >> at_most_once should have different recovery window.
> > >>
> > >> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <[email protected]
> >
> > >> wrote:
> > >>
> > >> > Hi Ashwin,
> > >> >
> > >> > In your example, if A fails the recovery windows would be
> > >> >
> > >> > D - 15
> > >> > C - 15
> > >> > B - 15
> > >> > A - 15
> > >> >
> > >> > If C fails the recovery windows would be
> > >> >
> > >> > D -15
> > >> > C -15
> > >> > B - 25
> > >> > A - 30
> > >> >
> > >> > If every operator just reached window 30 and checkpointed, the
> > committed
> > >> > window would be 25, and all the checkpoints before window 30 would
> be
> > >> > purged, but the checkpoint for window 30 would not be purged.
> > >> >
> > >> > Thanks,
> > >> > Tim
> > >> >
> > >> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Tim,
> > >> > >
> > >> > > Thanks, that is pretty much inline with what I was thinking. A
> > little
> > >> > > different thought though in terms of picking the checkpoint based
> on
> > >> > > downstream operators. For A, is it not going to be "the checkpoint
> > >> with
> > >> > the
> > >> > > largest window id that is less than or equal to the checkpoint
> with
> > >> the
> > >> > > largest common window id (instead of largest window id) among all
> > the
> > >> > > operators down stream to A"
> > >> > >
> > >> > > For example,
> > >> > >
> > >> > > If A -> B -> C -> D is the dag. And say, the checkpoint window
> count
> > >> is 5
> > >> > > and the largest checkpoints are as follows.
> > >> > >
> > >> > > A - 30
> > >> > > B - 25
> > >> > > C - 20
> > >> > > D - 15
> > >> > >
> > >> > > Does A recover at 25 (checkpoint with largest window id) or 15
> > >> > (checkpoint
> > >> > > with largest common window id)?
> > >> > >
> > >> > > Also, regarding recovering at committed window id. Is it not
> > possible
> > >> in
> > >> > > the following scenario where all operators have checkpointed at 30
> > and
> > >> > got
> > >> > > the committed window call back. And then an operator fails before
> > any
> > >> > > operator checkpoints further. In that case, the recovery window is
> > 30
> > >> > > right?
> > >> > >
> > >> > > Regards,
> > >> > > Ashwin.
> > >> > >
> > >> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <
> > [email protected]
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Ashwin,
> > >> > > >
> > >> > > > The recovery checkpoint for operator A is computed by taking the
> > >> > > checkpoint
> > >> > > > with the largest window id that is less than or equal to the
> > >> checkpoint
> > >> > > > with the largest window id among all the operators down stream
> to
> > A.
> > >> > The
> > >> > > > output operators in a dag will always recover to their most
> recent
> > >> > > > checkpoint. The input operator of the dag may recover to the
> > >> earliest
> > >> > > > checkpoint. Operators between the input and ouput operators
> could
> > >> > recover
> > >> > > > to a window in between.
> > >> > > >
> > >> > > > I don't think you can ever recover to a committed window, the
> > >> earliest
> > >> > I
> > >> > > > think you can recover to is the window after the committed
> window
> > >> (may
> > >> > be
> > >> > > > wrong on this).
> > >> > > >
> > >> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > >> > > > [email protected]> wrote:
> > >> > > >
> > >> > > > > In the apex architecture there is concept of checkpointing and
> > >> > concept
> > >> > > of
> > >> > > > > committed when all operator have crossed a common checkpoint.
> > >> > > > >
> > >> > > > > So, in which scenarios does a given operator recover at last
> > >> > checkpoint
> > >> > > > > window vs last committed window vs some other checkpoint
> window
> > in
> > >> > > > between?
> > >> > > > > --
> > >> > > > >
> > >> > > > > Regards,
> > >> > > > > Ashwin.
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Regards,
> > >> > > Ashwin.
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>
>
>
> --
>
> Regards,
> Ashwin.
>

Reply via email to