The design looks great - it solves for very diverse deployment modes, allows 
for heterogeneous TMs, and promotes job isolation.

Some feedback:

*Dispatcher*
The dispatcher concept here expands nicely on what was introduced in the Mesos 
design doc (MESOS-1984).  The most significant difference being the job-centric 
orientation of the dispatcher API.   FLIP-6 seems to eliminate the concept of a 
session (or, defines it simply as the lifecycle of a JM); is that correct?    
Do you agree I should revise the Mesos dispatcher design to be job-centric? 

I'll be taking the first crack at implementing the dispatcher (for Mesos only) 
in MESOS-1984 (T2).   I’ll keep FLIP-6 in mind as I go.

The dispatcher's backend behavior will vary significantly for Mesos vs 
standalone vs others.   Assumedly a base class with concrete implementations 
will be introduced.  To echo the FLIP-6 design as I understand it:

1) Standalone
   a) The dispatcher process starts an RM, dispatcher frontend, and "local" 
dispatcher backend at startup.
   b) Upon job submission, the local dispatcher backend creates an in-process 
JM actor for the job.
   c) The JM allocates slots as normal.   The RM draws from its pool of 
registered TM, which grows and shrinks due (only) to external events.

2) Mesos
   a) The dispatcher process starts a dispatcher frontend and "Mesos" 
dispatcher backend at startup.
   b) Upon job submission, the Mesos dispatcher backend creates a Mesos task 
(dubbed an "AppMaster") which contains a JM/RM for the job.   
   c) The system otherwise functions as described in the Mesos design doc.

*Client*
I'm concerned about the two code paths that the client uses to launch a job 
(with-dispatcher vs without-dispatcher).   Maybe it could be unified by saying 
that the client always calls the dispatcher, and that the dispatcher is 
hostable in either the client or in a separate process.  The only variance 
would be the client-to-dispatcher transport (local vs HTTP).

*RM*
On the issue of RM statefulness, we can say that the RM does not persist slot 
allocation (the ground truth is in the TM), but may persist other information 
(related to cluster manager interaction).  For example, the Mesos RM persists 
the assigned framework identifier and per-task planning information (as is 
highly recommended by the Mesos development guide).

On RM fencing, I was already wondering whether to add it to the Mesos RM, so it 
is nice to see it being introduced more generally.   My rationale is, the 
dispatcher cannot guarantee that only a single RM is running, because orphaned 
tasks are possible in certain Mesos failure situations.   Similarly, I’m unsure 
whether YARN provides a strong guarantee about the AM.

*User Code*
Having job code on the system classpath seems possible in only a subset of 
cases.   The variability may be complex.   How important is this optimization?

*Security Implications*
It should be noted that the standalone embodiment doesn't offer isolation 
between jobs.  The whole system will have a single security context (as it does 
now).   

Meanwhile, the ‘high-trust’ nature of the dispatcher in other scenarios is 
rightly emphasized.  The fact that user code shouldn't be run in the dispatcher 
process (except in standalone) must be kept in mind.   The design doc of 
FLINK-3929 (section C2) has more detail on that.


-Eron


> On Jul 28, 2016, at 2:22 AM, Maximilian Michels <m...@apache.org> wrote:
> 
> Hi Stephan,
> 
> Thanks for the nice wrap-up of ideas and discussions we had over the
> last months (not all on the mailing list though because we were just
> getting started with the FLIP process). The document is very
> comprehensive and explains the changes in great details, even up to
> the message passing level.
> 
> What I really like about the FLIP is that we delegate multi-tenancy
> away from the JobManager to the resource management framework and the
> dispatchers. This will help to make the JobManager component cleaner
> and simpler. The prospect of having the user jars directly in the
> system classpath of the workers, instead of dealing with custom class
> loaders, is very nice.
> 
> The model we have for acquiring and releasing resources wouldn't work
> particularly well with all the new deployment options, so +1 on a new
> task slot request/offer system and +1 for making the ResourceManager
> responsible for TaskManager registration and slot management. This is
> well aligned with the initial idea of the ResourceManager component.
> 
> We definitely need good testing for these changes since the
> possibility of bugs increases with the additional number of messages
> introduced.
> 
> The only thing that bugs me is whether we make the Standalone mode a
> bit less nice to use. The initial bootstrapping of the nodes via the
> local dispatchers and the subsequent registration of TaskManagers and
> allocation of slots could cause some delay. It's not a major concern
> though because it will take little time compared to the actual job run
> time (unless you run a tiny WordCount).
> 
> Cheers,
> Max
> 
> 
> 
> 
> On Fri, Jul 22, 2016 at 9:26 PM, Stephan Ewen <se...@apache.org> wrote:
>> Hi all!
>> 
>> Here comes a pretty big FLIP: "Improvements to the Flink Deployment and
>> Process Model", to better support Yarn, Mesos, Kubernetes, and whatever
>> else Google, Elon Musk, and all the other folks will think up next.
>> 
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077
>> 
>> It is a pretty big FLIP where I took input and thoughts from many people,
>> like Till, Max, Xiaowei (and his colleagues), Eron, and others.
>> 
>> The core ideas revolve around
>>  - making the JobManager in its core a per-job component (handle multi
>> tenancey outside the JobManager)
>>  - making resource acquisition and release more dynamic
>>  - tying deployments more naturally to jobs where desirable
>> 
>> 
>> Let's get the discussion started...
>> 
>> Greetings,
>> Stephan

Reply via email to