Hey Stephan,

Thanks for looking into it!

+1 for breaking this up, will do that.

I can see your point and maybe it makes sense to introduce part of scoping to 
incorporate support for nested loops (otherwise it can’t work).
Let us think about this a bit. We will share another draft for a more detail 
description of the approach you are suggesting asap.


> On 27 Oct 2016, at 10:55, Stephan Ewen <se...@apache.org> wrote:
> 
> How about we break this up into two FLIPs? There are after all two
> orthogonal problems (termination, fault tolerance) with quite different
> discussion states.
> 
> Concerning fault tolerance, I like the ideas.
> For the termination proposal, I would like to iterate a bit more.
> 
> *Termination algorithm:*
> 
> My main concern here is the introduction of a termination coordinator and
> any involvement of RPC messages when deciding termination.
> That would be such a fundamental break with the current runtime
> architecture, and it would make the currently very elegant and simple model
> much more complicated and harder to maintain. Given that Flink's runtime is
> complex enough, I would really like to avoid that.
> 
> The current runtime paradigm coordinates between operators strictly via
> in-band events. RPC calls happen between operators and the master for
> triggering and acknowledging execution and checkpoints.
> 
> I was wondering whether we can keep following that paradigm and still get
> most of what you are proposing here. In some sense, all we need to do is
> replace RPC calls with in-band events, and "decentralize" the coordinator
> such that every operator can make its own termination decision by itself.
> 
> This is only a rough sketch, you probably need to flesh it out more.
> 
>  - I assume that the OP in the diagram knows that it is in a loop and that
> it is the one connected to the head and tail
> 
>  - When OP receives and EndOfStream Event from the regular source (RS), it
> emits an "AttemptTermination" event downstream to the operators involved in
> the loop. It attaches an attempt sequence number and memorizes that
>  - Tail and Head forward these events
>  - When OP receives the event back with the same attempt sequence number,
> and no records came in the meantime, it shuts down and emits EndOfStream
> downstream
>  - When other records came back between emitting the AttemptTermination
> event and receiving it back, then it emits a new AttemptTermination event
> with the next sequence number.
>  - This should terminate as soon as the loop is empty.
> 
> Might this model even generalize to nested loops, where the
> "AttemptTermination" event is scoped by the loop's nesting level?
> 
> Let me know what you think!
> 
> 
> Best,
> Stephan
> 
> 
> On Thu, Oct 27, 2016 at 10:19 AM, Stephan Ewen <se...@apache.org> wrote:
> 
>> Hi!
>> 
>> I am still scanning it and compiling some comments. Give me a bit ;-)
>> 
>> Stephan
>> 
>> 
>> On Wed, Oct 26, 2016 at 6:13 PM, Paris Carbone <par...@kth.se> wrote:
>> 
>>> Hey all,
>>> 
>>> Now that many of you have already scanned the document (judging from the
>>> views) maybe it is time to give back some feedback!
>>> Did you like it? Would you suggest an improvement?
>>> 
>>> I would suggest not to leave this in the void. It has to do with
>>> important properties that the system promises to provide.
>>> Me and Fouad will do our best to answer your questions and discuss this
>>> further.
>>> 
>>> cheers
>>> Paris
>>> 
>>> On 21 Oct 2016, at 08:54, Paris Carbone <par...@kth.se<mailto:parisc@k
>>> th.se>> wrote:
>>> 
>>> Hello everyone,
>>> 
>>> Loops in Apache Flink have a good potential to become a much more
>>> powerful thing in future version of Apache Flink.
>>> There is generally high demand to make them usable and first of all
>>> production-ready for upcoming releases.
>>> 
>>> As a first commitment we would like to propose FLIP-13 for consistent
>>> processing with Loops.
>>> We are also working on scoped loops for Q1 2017 which we can share if
>>> there is enough interest.
>>> 
>>> For now, that is an improvement proposal that solves two pending major
>>> issues:
>>> 
>>> 1) The (not so trivial) problem of correct termination of jobs with
>>> iterations
>>> 2) The applicability of the checkpointing algorithm to iterative dataflow
>>> graphs.
>>> 
>>> We would really appreciate it if you go through the linked draft
>>> (motivation and proposed changes) for FLIP-13 and point out comments,
>>> preferably publicly in this devlist discussion before we go ahead and
>>> update the wiki.
>>> 
>>> https://docs.google.com/document/d/1M6ERj-TzlykMLHzPSwW5L9b0
>>> BhDbtoYucmByBjRBISs/edit?usp=sharing
>>> 
>>> cheers
>>> 
>>> Paris and Fouad
>>> 
>>> 
>>> 
>> 

Reply via email to