Hi Eron! As per our separate discussion, the main thing here is a confusion about what a session is.
I would keep the initial API of the dispatcher to just "launchJob(JobGraph, artifacts)". It would simply start individual jobs in the cluster. If we want to make the dispatcher "resource session" aware, we need to decide between two options, in my opinion: - The dispatcher simply launches the session's core process (ResourceManager and SessionJobManager) and returns the sessions endpoint info. The SessionExecutionEnvironment then uses a client that directly communicates with the SessionJobManager and sends it new jobs. - The dispatcher is the gateway for all interactions with that session, meaning it forwards the calls by the SessionExecutionEnvironment to the SessionJobManager. My gut feeling is that this is sufficiently separate from the "dispatch/start individual job" functionality that it would not hurt to only focus on that in the first iteration. Greetings, Stephan On Fri, Aug 5, 2016 at 6:19 PM, Wright, Eron <ewri...@live.com> wrote: > > Let me rephrase my comment on the dispatcher. I mean that its API would > be job-centric, i.e. with operations like `execute(jobspec)` rather than > operations like `createSession` that the status-quo would suggest. > > Since writing those comments I’ve put more time into developing the Mesos > dispatcher with FLIP-6 in mind. I see that Till is spinning up an effort > too, so we should all sync up in the near future. > > Eron > > > > > On Aug 5, 2016, at 7:30 AM, Stephan Ewen <se...@apache.org> wrote: > > > > Hi Eron! > > > > Some comments on your comments: > > > > *Dispatcher* > > - The dispatcher should NOT be job-centric. The dispatcher should take > > over the "multi job" responsibilities here, now that the JobManager is > > single-job only. > > - An abstract dispatcher would be great. It could implement the > > connection/HTTP elements and have an abstract method to start a job > > -> Yarn - use YarnClusterClient to start a YarnJob > > -> Mesos - same thing > > -> Standalone - spawn a JobManager > > > > *Client* > > This is an interesting point. Max is currently refactoring the clients > into > > - Cluster Client (with specialization for Yarn, Standalone) to launch > > jobs and control a cluster (yarn session, ...) > > - Job Client, which is connected to a single job and can issue commands > > to that job (cancel/stop/checkpoint/savepoint/change-parallelism) > > > > Let's try and get his input on this. > > > > > > *RM* > > Agreed - the base RM is "stateless", specialized RMs can behave > different, > > if they need to. > > RM fencing must be generic - all cluster types can suffer from orphaned > > tasks (Yarn as well, I think) > > > > > > *User Code* > > I think in the cases where processes/containers are launched per-job, > this > > should always be feasible. It is a nice optimization that I think we > should > > do where ever possible. Makes users' life with respect to classloading > much > > easier. > > Some cases with custom class loading are currently tough in Flink - that > > way, these jobs would at least run in the yarn/mesos individual job mode > > (not the session mode still, that one needs dynamic class loading). > > > > *Standalone Security* > > That is a known limitation and fine for now, I think. Whoever wants > proper > > security needs to go to Yarn/Mesos initially. Standalone v2.0 may change > > that. > > > > Greetings, > > Stephan > > > > > > > > On Sat, Jul 30, 2016 at 12:26 AM, Wright, Eron <ewri...@live.com> wrote: > > > >> The design looks great - it solves for very diverse deployment modes, > >> allows for heterogeneous TMs, and promotes job isolation. > >> > >> Some feedback: > >> > >> *Dispatcher* > >> The dispatcher concept here expands nicely on what was introduced in the > >> Mesos design doc (MESOS-1984). The most significant difference being > the > >> job-centric orientation of the dispatcher API. FLIP-6 seems to > eliminate > >> the concept of a session (or, defines it simply as the lifecycle of a > JM); > >> is that correct? Do you agree I should revise the Mesos dispatcher > >> design to be job-centric? > >> > >> I'll be taking the first crack at implementing the dispatcher (for Mesos > >> only) in MESOS-1984 (T2). I’ll keep FLIP-6 in mind as I go. > >> > >> The dispatcher's backend behavior will vary significantly for Mesos vs > >> standalone vs others. Assumedly a base class with concrete > >> implementations will be introduced. To echo the FLIP-6 design as I > >> understand it: > >> > >> 1) Standalone > >> a) The dispatcher process starts an RM, dispatcher frontend, and > >> "local" dispatcher backend at startup. > >> b) Upon job submission, the local dispatcher backend creates an > >> in-process JM actor for the job. > >> c) The JM allocates slots as normal. The RM draws from its pool of > >> registered TM, which grows and shrinks due (only) to external events. > >> > >> 2) Mesos > >> a) The dispatcher process starts a dispatcher frontend and "Mesos" > >> dispatcher backend at startup. > >> b) Upon job submission, the Mesos dispatcher backend creates a Mesos > >> task (dubbed an "AppMaster") which contains a JM/RM for the job. > >> c) The system otherwise functions as described in the Mesos design > doc. > >> > >> *Client* > >> I'm concerned about the two code paths that the client uses to launch a > >> job (with-dispatcher vs without-dispatcher). Maybe it could be > unified by > >> saying that the client always calls the dispatcher, and that the > dispatcher > >> is hostable in either the client or in a separate process. The only > >> variance would be the client-to-dispatcher transport (local vs HTTP). > >> > >> *RM* > >> On the issue of RM statefulness, we can say that the RM does not persist > >> slot allocation (the ground truth is in the TM), but may persist other > >> information (related to cluster manager interaction). For example, the > >> Mesos RM persists the assigned framework identifier and per-task > planning > >> information (as is highly recommended by the Mesos development guide). > >> > >> On RM fencing, I was already wondering whether to add it to the Mesos > RM, > >> so it is nice to see it being introduced more generally. My rationale > is, > >> the dispatcher cannot guarantee that only a single RM is running, > because > >> orphaned tasks are possible in certain Mesos failure situations. > >> Similarly, I’m unsure whether YARN provides a strong guarantee about the > >> AM. > >> > >> *User Code* > >> Having job code on the system classpath seems possible in only a subset > of > >> cases. The variability may be complex. How important is this > >> optimization? > >> > >> *Security Implications* > >> It should be noted that the standalone embodiment doesn't offer > isolation > >> between jobs. The whole system will have a single security context (as > it > >> does now). > >> > >> Meanwhile, the ‘high-trust’ nature of the dispatcher in other scenarios > is > >> rightly emphasized. The fact that user code shouldn't be run in the > >> dispatcher process (except in standalone) must be kept in mind. The > >> design doc of FLINK-3929 (section C2) has more detail on that. > >> > >> > >> -Eron > >> > >> > >>> On Jul 28, 2016, at 2:22 AM, Maximilian Michels <m...@apache.org> > wrote: > >>> > >>> Hi Stephan, > >>> > >>> Thanks for the nice wrap-up of ideas and discussions we had over the > >>> last months (not all on the mailing list though because we were just > >>> getting started with the FLIP process). The document is very > >>> comprehensive and explains the changes in great details, even up to > >>> the message passing level. > >>> > >>> What I really like about the FLIP is that we delegate multi-tenancy > >>> away from the JobManager to the resource management framework and the > >>> dispatchers. This will help to make the JobManager component cleaner > >>> and simpler. The prospect of having the user jars directly in the > >>> system classpath of the workers, instead of dealing with custom class > >>> loaders, is very nice. > >>> > >>> The model we have for acquiring and releasing resources wouldn't work > >>> particularly well with all the new deployment options, so +1 on a new > >>> task slot request/offer system and +1 for making the ResourceManager > >>> responsible for TaskManager registration and slot management. This is > >>> well aligned with the initial idea of the ResourceManager component. > >>> > >>> We definitely need good testing for these changes since the > >>> possibility of bugs increases with the additional number of messages > >>> introduced. > >>> > >>> The only thing that bugs me is whether we make the Standalone mode a > >>> bit less nice to use. The initial bootstrapping of the nodes via the > >>> local dispatchers and the subsequent registration of TaskManagers and > >>> allocation of slots could cause some delay. It's not a major concern > >>> though because it will take little time compared to the actual job run > >>> time (unless you run a tiny WordCount). > >>> > >>> Cheers, > >>> Max > >>> > >>> > >>> > >>> > >>> On Fri, Jul 22, 2016 at 9:26 PM, Stephan Ewen <se...@apache.org> > wrote: > >>>> Hi all! > >>>> > >>>> Here comes a pretty big FLIP: "Improvements to the Flink Deployment > and > >>>> Process Model", to better support Yarn, Mesos, Kubernetes, and > whatever > >>>> else Google, Elon Musk, and all the other folks will think up next. > >>>> > >>>> https://cwiki.apache.org/confluence/pages/viewpage. > >> action?pageId=65147077 > >>>> > >>>> It is a pretty big FLIP where I took input and thoughts from many > >> people, > >>>> like Till, Max, Xiaowei (and his colleagues), Eron, and others. > >>>> > >>>> The core ideas revolve around > >>>> - making the JobManager in its core a per-job component (handle multi > >>>> tenancey outside the JobManager) > >>>> - making resource acquisition and release more dynamic > >>>> - tying deployments more naturally to jobs where desirable > >>>> > >>>> > >>>> Let's get the discussion started... > >>>> > >>>> Greetings, > >>>> Stephan > >> > >> > >