Hey Stephan, Thanks for looking into it!
+1 for breaking this up, will do that. I can see your point and maybe it makes sense to introduce part of scoping to incorporate support for nested loops (otherwise it can’t work). Let us think about this a bit. We will share another draft for a more detail description of the approach you are suggesting asap. > On 27 Oct 2016, at 10:55, Stephan Ewen <se...@apache.org> wrote: > > How about we break this up into two FLIPs? There are after all two > orthogonal problems (termination, fault tolerance) with quite different > discussion states. > > Concerning fault tolerance, I like the ideas. > For the termination proposal, I would like to iterate a bit more. > > *Termination algorithm:* > > My main concern here is the introduction of a termination coordinator and > any involvement of RPC messages when deciding termination. > That would be such a fundamental break with the current runtime > architecture, and it would make the currently very elegant and simple model > much more complicated and harder to maintain. Given that Flink's runtime is > complex enough, I would really like to avoid that. > > The current runtime paradigm coordinates between operators strictly via > in-band events. RPC calls happen between operators and the master for > triggering and acknowledging execution and checkpoints. > > I was wondering whether we can keep following that paradigm and still get > most of what you are proposing here. In some sense, all we need to do is > replace RPC calls with in-band events, and "decentralize" the coordinator > such that every operator can make its own termination decision by itself. > > This is only a rough sketch, you probably need to flesh it out more. > > - I assume that the OP in the diagram knows that it is in a loop and that > it is the one connected to the head and tail > > - When OP receives and EndOfStream Event from the regular source (RS), it > emits an "AttemptTermination" event downstream to the operators involved in > the loop. It attaches an attempt sequence number and memorizes that > - Tail and Head forward these events > - When OP receives the event back with the same attempt sequence number, > and no records came in the meantime, it shuts down and emits EndOfStream > downstream > - When other records came back between emitting the AttemptTermination > event and receiving it back, then it emits a new AttemptTermination event > with the next sequence number. > - This should terminate as soon as the loop is empty. > > Might this model even generalize to nested loops, where the > "AttemptTermination" event is scoped by the loop's nesting level? > > Let me know what you think! > > > Best, > Stephan > > > On Thu, Oct 27, 2016 at 10:19 AM, Stephan Ewen <se...@apache.org> wrote: > >> Hi! >> >> I am still scanning it and compiling some comments. Give me a bit ;-) >> >> Stephan >> >> >> On Wed, Oct 26, 2016 at 6:13 PM, Paris Carbone <par...@kth.se> wrote: >> >>> Hey all, >>> >>> Now that many of you have already scanned the document (judging from the >>> views) maybe it is time to give back some feedback! >>> Did you like it? Would you suggest an improvement? >>> >>> I would suggest not to leave this in the void. It has to do with >>> important properties that the system promises to provide. >>> Me and Fouad will do our best to answer your questions and discuss this >>> further. >>> >>> cheers >>> Paris >>> >>> On 21 Oct 2016, at 08:54, Paris Carbone <par...@kth.se<mailto:parisc@k >>> th.se>> wrote: >>> >>> Hello everyone, >>> >>> Loops in Apache Flink have a good potential to become a much more >>> powerful thing in future version of Apache Flink. >>> There is generally high demand to make them usable and first of all >>> production-ready for upcoming releases. >>> >>> As a first commitment we would like to propose FLIP-13 for consistent >>> processing with Loops. >>> We are also working on scoped loops for Q1 2017 which we can share if >>> there is enough interest. >>> >>> For now, that is an improvement proposal that solves two pending major >>> issues: >>> >>> 1) The (not so trivial) problem of correct termination of jobs with >>> iterations >>> 2) The applicability of the checkpointing algorithm to iterative dataflow >>> graphs. >>> >>> We would really appreciate it if you go through the linked draft >>> (motivation and proposed changes) for FLIP-13 and point out comments, >>> preferably publicly in this devlist discussion before we go ahead and >>> update the wiki. >>> >>> https://docs.google.com/document/d/1M6ERj-TzlykMLHzPSwW5L9b0 >>> BhDbtoYucmByBjRBISs/edit?usp=sharing >>> >>> cheers >>> >>> Paris and Fouad >>> >>> >>> >>