Let me rephrase my comment on the dispatcher. I mean that its API would be job-centric, i.e. with operations like `execute(jobspec)` rather than operations like `createSession` that the status-quo would suggest.
Since writing those comments I’ve put more time into developing the Mesos dispatcher with FLIP-6 in mind. I see that Till is spinning up an effort too, so we should all sync up in the near future. Eron > On Aug 5, 2016, at 7:30 AM, Stephan Ewen <se...@apache.org> wrote: > > Hi Eron! > > Some comments on your comments: > > *Dispatcher* > - The dispatcher should NOT be job-centric. The dispatcher should take > over the "multi job" responsibilities here, now that the JobManager is > single-job only. > - An abstract dispatcher would be great. It could implement the > connection/HTTP elements and have an abstract method to start a job > -> Yarn - use YarnClusterClient to start a YarnJob > -> Mesos - same thing > -> Standalone - spawn a JobManager > > *Client* > This is an interesting point. Max is currently refactoring the clients into > - Cluster Client (with specialization for Yarn, Standalone) to launch > jobs and control a cluster (yarn session, ...) > - Job Client, which is connected to a single job and can issue commands > to that job (cancel/stop/checkpoint/savepoint/change-parallelism) > > Let's try and get his input on this. > > > *RM* > Agreed - the base RM is "stateless", specialized RMs can behave different, > if they need to. > RM fencing must be generic - all cluster types can suffer from orphaned > tasks (Yarn as well, I think) > > > *User Code* > I think in the cases where processes/containers are launched per-job, this > should always be feasible. It is a nice optimization that I think we should > do where ever possible. Makes users' life with respect to classloading much > easier. > Some cases with custom class loading are currently tough in Flink - that > way, these jobs would at least run in the yarn/mesos individual job mode > (not the session mode still, that one needs dynamic class loading). > > *Standalone Security* > That is a known limitation and fine for now, I think. Whoever wants proper > security needs to go to Yarn/Mesos initially. Standalone v2.0 may change > that. > > Greetings, > Stephan > > > > On Sat, Jul 30, 2016 at 12:26 AM, Wright, Eron <ewri...@live.com> wrote: > >> The design looks great - it solves for very diverse deployment modes, >> allows for heterogeneous TMs, and promotes job isolation. >> >> Some feedback: >> >> *Dispatcher* >> The dispatcher concept here expands nicely on what was introduced in the >> Mesos design doc (MESOS-1984). The most significant difference being the >> job-centric orientation of the dispatcher API. FLIP-6 seems to eliminate >> the concept of a session (or, defines it simply as the lifecycle of a JM); >> is that correct? Do you agree I should revise the Mesos dispatcher >> design to be job-centric? >> >> I'll be taking the first crack at implementing the dispatcher (for Mesos >> only) in MESOS-1984 (T2). I’ll keep FLIP-6 in mind as I go. >> >> The dispatcher's backend behavior will vary significantly for Mesos vs >> standalone vs others. Assumedly a base class with concrete >> implementations will be introduced. To echo the FLIP-6 design as I >> understand it: >> >> 1) Standalone >> a) The dispatcher process starts an RM, dispatcher frontend, and >> "local" dispatcher backend at startup. >> b) Upon job submission, the local dispatcher backend creates an >> in-process JM actor for the job. >> c) The JM allocates slots as normal. The RM draws from its pool of >> registered TM, which grows and shrinks due (only) to external events. >> >> 2) Mesos >> a) The dispatcher process starts a dispatcher frontend and "Mesos" >> dispatcher backend at startup. >> b) Upon job submission, the Mesos dispatcher backend creates a Mesos >> task (dubbed an "AppMaster") which contains a JM/RM for the job. >> c) The system otherwise functions as described in the Mesos design doc. >> >> *Client* >> I'm concerned about the two code paths that the client uses to launch a >> job (with-dispatcher vs without-dispatcher). Maybe it could be unified by >> saying that the client always calls the dispatcher, and that the dispatcher >> is hostable in either the client or in a separate process. The only >> variance would be the client-to-dispatcher transport (local vs HTTP). >> >> *RM* >> On the issue of RM statefulness, we can say that the RM does not persist >> slot allocation (the ground truth is in the TM), but may persist other >> information (related to cluster manager interaction). For example, the >> Mesos RM persists the assigned framework identifier and per-task planning >> information (as is highly recommended by the Mesos development guide). >> >> On RM fencing, I was already wondering whether to add it to the Mesos RM, >> so it is nice to see it being introduced more generally. My rationale is, >> the dispatcher cannot guarantee that only a single RM is running, because >> orphaned tasks are possible in certain Mesos failure situations. >> Similarly, I’m unsure whether YARN provides a strong guarantee about the >> AM. >> >> *User Code* >> Having job code on the system classpath seems possible in only a subset of >> cases. The variability may be complex. How important is this >> optimization? >> >> *Security Implications* >> It should be noted that the standalone embodiment doesn't offer isolation >> between jobs. The whole system will have a single security context (as it >> does now). >> >> Meanwhile, the ‘high-trust’ nature of the dispatcher in other scenarios is >> rightly emphasized. The fact that user code shouldn't be run in the >> dispatcher process (except in standalone) must be kept in mind. The >> design doc of FLINK-3929 (section C2) has more detail on that. >> >> >> -Eron >> >> >>> On Jul 28, 2016, at 2:22 AM, Maximilian Michels <m...@apache.org> wrote: >>> >>> Hi Stephan, >>> >>> Thanks for the nice wrap-up of ideas and discussions we had over the >>> last months (not all on the mailing list though because we were just >>> getting started with the FLIP process). The document is very >>> comprehensive and explains the changes in great details, even up to >>> the message passing level. >>> >>> What I really like about the FLIP is that we delegate multi-tenancy >>> away from the JobManager to the resource management framework and the >>> dispatchers. This will help to make the JobManager component cleaner >>> and simpler. The prospect of having the user jars directly in the >>> system classpath of the workers, instead of dealing with custom class >>> loaders, is very nice. >>> >>> The model we have for acquiring and releasing resources wouldn't work >>> particularly well with all the new deployment options, so +1 on a new >>> task slot request/offer system and +1 for making the ResourceManager >>> responsible for TaskManager registration and slot management. This is >>> well aligned with the initial idea of the ResourceManager component. >>> >>> We definitely need good testing for these changes since the >>> possibility of bugs increases with the additional number of messages >>> introduced. >>> >>> The only thing that bugs me is whether we make the Standalone mode a >>> bit less nice to use. The initial bootstrapping of the nodes via the >>> local dispatchers and the subsequent registration of TaskManagers and >>> allocation of slots could cause some delay. It's not a major concern >>> though because it will take little time compared to the actual job run >>> time (unless you run a tiny WordCount). >>> >>> Cheers, >>> Max >>> >>> >>> >>> >>> On Fri, Jul 22, 2016 at 9:26 PM, Stephan Ewen <se...@apache.org> wrote: >>>> Hi all! >>>> >>>> Here comes a pretty big FLIP: "Improvements to the Flink Deployment and >>>> Process Model", to better support Yarn, Mesos, Kubernetes, and whatever >>>> else Google, Elon Musk, and all the other folks will think up next. >>>> >>>> https://cwiki.apache.org/confluence/pages/viewpage. >> action?pageId=65147077 >>>> >>>> It is a pretty big FLIP where I took input and thoughts from many >> people, >>>> like Till, Max, Xiaowei (and his colleagues), Eron, and others. >>>> >>>> The core ideas revolve around >>>> - making the JobManager in its core a per-job component (handle multi >>>> tenancey outside the JobManager) >>>> - making resource acquisition and release more dynamic >>>> - tying deployments more naturally to jobs where desirable >>>> >>>> >>>> Let's get the discussion started... >>>> >>>> Greetings, >>>> Stephan >> >>