Siyuan, Yes, we are discussing at least once semantics.
Tim, So it is indeed possible that we recover at committed window id in a case where we just committed and there were no further checkpoints before failure. Regards, Ashwin. On Tue, Dec 15, 2015 at 1:54 PM, Timothy Farkas <[email protected]> wrote: > Whoops my bad, that would never happen. There is a check that only allows > purging of checkpoints for an operator if the operator has more than one > checkpoint. :) > > On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <[email protected]> > wrote: > > > Siyuan, then Ashwin may be right that there is an issue. Looking at the > > code again I think this could happen: > > > > 1 - All operators reach checkpiont 30 > > 2 - Checkpoints are updated on heartbeat and committed window is now 25, > > everything before window 30 is purged > > 3 - no new checkpoint is reached for any operator > > 4 - Checkpoints are updated on heartbeat again and committed window is > now > > 30, now window 30 is purged. > > > > May be missing something again though. > > > > On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <[email protected]> > > wrote: > > > >> My understanding is the committed window could possibly be 30 as well, > >> depends on whether container manager get heart beat from containers. > >> > >> And I guess the discussion is assuming at_least_once semantic? :) > >> at_most_once should have different recovery window. > >> > >> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <[email protected]> > >> wrote: > >> > >> > Hi Ashwin, > >> > > >> > In your example, if A fails the recovery windows would be > >> > > >> > D - 15 > >> > C - 15 > >> > B - 15 > >> > A - 15 > >> > > >> > If C fails the recovery windows would be > >> > > >> > D -15 > >> > C -15 > >> > B - 25 > >> > A - 30 > >> > > >> > If every operator just reached window 30 and checkpointed, the > committed > >> > window would be 25, and all the checkpoints before window 30 would be > >> > purged, but the checkpoint for window 30 would not be purged. > >> > > >> > Thanks, > >> > Tim > >> > > >> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta < > >> > [email protected]> wrote: > >> > > >> > > Tim, > >> > > > >> > > Thanks, that is pretty much inline with what I was thinking. A > little > >> > > different thought though in terms of picking the checkpoint based on > >> > > downstream operators. For A, is it not going to be "the checkpoint > >> with > >> > the > >> > > largest window id that is less than or equal to the checkpoint with > >> the > >> > > largest common window id (instead of largest window id) among all > the > >> > > operators down stream to A" > >> > > > >> > > For example, > >> > > > >> > > If A -> B -> C -> D is the dag. And say, the checkpoint window count > >> is 5 > >> > > and the largest checkpoints are as follows. > >> > > > >> > > A - 30 > >> > > B - 25 > >> > > C - 20 > >> > > D - 15 > >> > > > >> > > Does A recover at 25 (checkpoint with largest window id) or 15 > >> > (checkpoint > >> > > with largest common window id)? > >> > > > >> > > Also, regarding recovering at committed window id. Is it not > possible > >> in > >> > > the following scenario where all operators have checkpointed at 30 > and > >> > got > >> > > the committed window call back. And then an operator fails before > any > >> > > operator checkpoints further. In that case, the recovery window is > 30 > >> > > right? > >> > > > >> > > Regards, > >> > > Ashwin. > >> > > > >> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas < > [email protected] > >> > > >> > > wrote: > >> > > > >> > > > Hi Ashwin, > >> > > > > >> > > > The recovery checkpoint for operator A is computed by taking the > >> > > checkpoint > >> > > > with the largest window id that is less than or equal to the > >> checkpoint > >> > > > with the largest window id among all the operators down stream to > A. > >> > The > >> > > > output operators in a dag will always recover to their most recent > >> > > > checkpoint. The input operator of the dag may recover to the > >> earliest > >> > > > checkpoint. Operators between the input and ouput operators could > >> > recover > >> > > > to a window in between. > >> > > > > >> > > > I don't think you can ever recover to a committed window, the > >> earliest > >> > I > >> > > > think you can recover to is the window after the committed window > >> (may > >> > be > >> > > > wrong on this). > >> > > > > >> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta < > >> > > > [email protected]> wrote: > >> > > > > >> > > > > In the apex architecture there is concept of checkpointing and > >> > concept > >> > > of > >> > > > > committed when all operator have crossed a common checkpoint. > >> > > > > > >> > > > > So, in which scenarios does a given operator recover at last > >> > checkpoint > >> > > > > window vs last committed window vs some other checkpoint window > in > >> > > > between? > >> > > > > -- > >> > > > > > >> > > > > Regards, > >> > > > > Ashwin. > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > > >> > > Regards, > >> > > Ashwin. > >> > > > >> > > >> > > > > > -- Regards, Ashwin.
