Re: [DISCUSS] FLIP-74: Flink JobClient API

Flavio Pompermaier Fri, 27 Sep 2019 01:32:49 -0700

Hi all,
just a remark about the Flink REST APIs (and its client as well): almost
all the times we need a way to dynamically know which jobs are contained in
a jar file, and this could be exposed by the REST endpoint under
/jars/:jarid/entry-points (a simple way to implement this would be to check
the value of Main-class or Main-classes inside the Manifest of the jar if
they exists [1]).


I understand that this is something that is not strictly required to
execute Flink jobs but IMHO it would ease A LOT the work of UI developers
that could have a way to show the users all available jobs inside a jar +
their configurable parameters.
For example, right now in the WebUI, you can upload a jar and then you have
to set (without any autocomplete or UI support) the main class and their
params (for example using a string like --param1 xx --param2 yy).
Adding this functionality to the REST API and the respective client would
enable the WebUI (and all UIs interacting with a Flink cluster) to prefill
a dropdown list containing the list of entry-point classes (i.e. Flink
jobs) and, once selected, their required (typed) parameters.

Best,
Flavio

[1] https://issues.apache.org/jira/browse/FLINK-10864

On Fri, Sep 27, 2019 at 9:16 AM Zili Chen <wander4...@gmail.com> wrote:

> modify
>
> /we just shutdown the cluster on the exit of client that running inside
> cluster/
>
> to
>
> we just shutdown the cluster on both the exit of client that running inside
> cluster and the finish of job.
> Since client is running inside cluster we can easily wait for the end of
> two both in ClusterEntrypoint.
>
>
> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午3:13写道：
>
> > About JobCluster
> >
> > Actually I am not quite sure what we gains from DETACHED configuration on
> > cluster side.
> > We don't have a NON-DETACHED JobCluster in fact in our codebase, right?
> >
> > It comes to me one major questions we have to answer first.
> >
> > *What JobCluster conceptually is exactly*
> >
> > Related discussion can be found in JIRA[1] and mailing list[2]. Stephan
> > gives a nice
> > description of JobCluster:
> >
> > Two things to add: - The job mode is very nice in the way that it runs
> the
> > client inside the cluster (in the same image/process that is the JM) and
> > thus unifies both applications and what the Spark world calls the "driver
> > mode". - Another thing I would add is that during the FLIP-6 design, we
> > were thinking about setups where Dispatcher and JobManager are separate
> > processes. A Yarn or Mesos Dispatcher of a session could run
> independently
> > (even as privileged processes executing no code). Then you the "per-job"
> > mode could still be helpful: when a job is submitted to the dispatcher,
> it
> > launches the JM again in a per-job mode, so that JM and TM processes are
> > bound to teh job only. For higher security setups, it is important that
> > processes are not reused across jobs.
> >
> > However, currently in "per-job" mode we generate JobGraph in client side,
> > launching
> > the JobCluster and retrieve the JobGraph for execution. So actually, we
> > don't "run the
> > client inside the cluster".
> >
> > Besides, refer to the discussion with Till[1], it would be helpful we
> > follow the same process
> > of session mode for that of "per-job" mode in user perspective, that we
> > don't use
> > OptimizedPlanEnvironment to create JobGraph, but directly deploy Flink
> > cluster in env.execute.
> >
> > Generally 2 points
> >
> > 1. Running Flink job by invoke user main method and execute throughout,
> > instead of create
> > JobGraph from main-class.
> > 2. Run the client inside the cluster.
> >
> > If 1 and 2 are implemented. There is obvious no need for DETACHED mode in
> > cluster side
> > because we just shutdown the cluster on the exit of client that running
> > inside cluster. Whether
> > or not delivered the result is up to user code.
> >
> > [1]
> >
> https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16931388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931388
> > [2]
> >
> https://lists.apache.org/x/thread.html/e8f14a381be6c027e8945f884c3cfcb309ce49c1ba557d3749fca495@%3Cdev.flink.apache.org%3E
> >
> >
> > Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午2:13写道：
> >
> >> Thanks for your replies Kostas & Aljoscha!
> >>
> >> Below are replies point by point.
> >>
> >> 1. For DETACHED mode, what I said there is about the DETACHED mode in
> >> client side.
> >> There are two configurations overload the item DETACHED[1].
> >>
> >> In client side, it means whether or not client.submitJob is blocking to
> >> job execution result.
> >> Due to client.submitJob returns CompletableFuture<JobClient>
> NON-DETACHED
> >> is no
> >> power at all. Caller of submitJob makes the decision whether or not
> >> blocking to get the
> >> JobClient and request for the job execution result. If client crashes,
> it
> >> is a user scope
> >> exception that should be handled in user code; if client lost connection
> >> to cluster, we have
> >> a retry times and interval configuration that automatically retry and
> >> throws an user scope
> >> exception if exceed.
> >>
> >> Your comment about poll for result or job result sounds like a concern
> on
> >> cluster side.
> >>
> >> In cluster side, DETACHED mode is alive only in JobCluster. If DETACHED
> >> configured,
> >> JobCluster exits on job finished; if NON-DETACHED configured, JobCluster
> >> exits on job
> >> execution result delivered. FLIP-74 doesn't stick to changes on this
> >> scope, it is just remained.
> >>
> >> However, it is an interesting part we can revisit this implementation a
> >> bit.
> >>
> >> <see the next email for compact reply in this one>
> >>
> >> 2. The retrieval of JobClient is so important that if we don't have a
> way
> >> to retrieve JobClient it is
> >> a dumb public user-facing interface(what a strange state :P).
> >>
> >> About the retrieval of JobClient, as mentioned in the document, two ways
> >> should be supported.
> >>
> >> (1). Retrieved as return type of job submission.
> >> (2). Retrieve a JobClient of existing job.(with job id)
> >>
> >> I highly respect your thoughts about how Executors should be and
> thoughts
> >> on multi-layered clients.
> >> Although, (2) is not supported by public interfaces as summary of
> >> discussion above, we can discuss
> >> a bit on the place of Executors on multi-layered clients and find a way
> >> to retrieve JobClient of
> >> existing job with public client API. I will comment in FLIP-73 thread[2]
> >> since it is almost about Executors.
> >>
> >> Best,
> >> tison.
> >>
> >> [1]
> >>
> https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit?disco=AAAADnLLvM8
> >> [2]
> >>
> https://lists.apache.org/x/thread.html/dc3a541709f96906b43df4155373af1cd09e08c3f105b0bd0ba3fca2@%3Cdev.flink.apache.org%3E
> >>
> >>
> >>
> >>
> >> Kostas Kloudas <kklou...@gmail.com> 于2019年9月25日周三 下午9:29写道：
> >>
> >>> Hi Tison,
> >>>
> >>> Thanks for the FLIP and launching the discussion!
> >>>
> >>> As a first note, big +1 on providing/exposing a JobClient to the users!
> >>>
> >>> Some points that would be nice to be clarified:
> >>> 1) You mention that we can get rid of the DETACHED mode: I agree that
> >>> at a high level, given that everything will now be asynchronous, there
> >>> is no need to keep the DETACHED mode but I think we should specify
> >>> some aspects. For example, without the explicit separation of the
> >>> modes, what happens when the job finishes. Does the client
> >>> periodically poll for the result always or the result is pushed when
> >>> in NON-DETACHED mode? What happens if the client disconnects and
> >>> reconnects?
> >>>
> >>> 2) On the "how to retrieve a JobClient for a running Job", I think
> >>> this is related to the other discussion you opened in the ML about
> >>> multi-layered clients. First of all, I agree that exposing different
> >>> "levels" of clients would be a nice addition, and actually there have
> >>> been some discussions about doing so in the future. Now for this
> >>> specific discussion:
> >>>       i) I do not think that we should expose the
> >>> ClusterDescriptor/ClusterSpecification to the user, as this ties us to
> >>> a specific architecture which may change in the future.
> >>>      ii) I do not think it should be the Executor that will provide a
> >>> JobClient for an already running job (only for the Jobs that it
> >>> submits). The job of the executor should just be to execute() a
> >>> pipeline.
> >>>      iii) I think a solution that respects the separation of concerns
> >>> could be the addition of another component (in the future), something
> >>> like a ClientFactory, or ClusterFactory that will have methods like:
> >>> ClusterClient createCluster(Configuration), JobClient
> >>> retrieveJobClient(Configuration , JobId), maybe even (although not
> >>> sure) Executor getExecutor(Configuration ) and maybe more. This
> >>> component would be responsible to interact with a cluster manager like
> >>> Yarn and do what is now being done by the ClusterDescriptor plus some
> >>> more stuff.
> >>>
> >>> Although under the hood all these abstractions (Environments,
> >>> Executors, ...) underneath use the same clients, I believe their
> >>> job/existence is not contradicting but they simply hide some of the
> >>> complexity from the user, and give us, as developers some freedom to
> >>> change in the future some of the parts. For example, the executor will
> >>> take a Pipeline, create a JobGraph and submit it, instead of requiring
> >>> the user to do each step separately. This allows us to, for example,
> >>> get rid of the Plan if in the future everything is DataStream.
> >>> Essentially, I think of these as layers of an onion with the clients
> >>> being close to the core. The higher you go, the more functionality is
> >>> included and hidden from the public eye.
> >>>
> >>> Point iii) by the way is just a thought and by no means final. I also
> >>> like the idea of multi-layered clients so this may spark up the
> >>> discussion.
> >>>
> >>> Cheers,
> >>> Kostas
> >>>
> >>> On Wed, Sep 25, 2019 at 2:21 PM Aljoscha Krettek <aljos...@apache.org>
> >>> wrote:
> >>> >
> >>> > Hi Tison,
> >>> >
> >>> > Thanks for proposing the document! I had some comments on the
> document.
> >>> >
> >>> > I think the only complex thing that we still need to figure out is
> how
> >>> to get a JobClient for a job that is already running. As you mentioned
> in
> >>> the document. Currently I’m thinking that its ok to add a method to
> >>> Executor for retrieving a JobClient for a running job by providing an
> ID.
> >>> Let’s see what Kostas has to say on the topic.
> >>> >
> >>> > Best,
> >>> > Aljoscha
> >>> >
> >>> > > On 25. Sep 2019, at 12:31, Zili Chen <wander4...@gmail.com> wrote:
> >>> > >
> >>> > > Hi all,
> >>> > >
> >>> > > Summary from the discussion about introducing Flink JobClient
> API[1]
> >>> we
> >>> > > draft FLIP-74[2] to
> >>> > > gather thoughts and towards a standard public user-facing
> interfaces.
> >>> > >
> >>> > > This discussion thread aims at standardizing job level client API.
> >>> But I'd
> >>> > > like to emphasize that
> >>> > > how to retrieve JobClient possibly causes further discussion on
> >>> different
> >>> > > level clients exposed from
> >>> > > Flink so that a following thread will be started later to
> coordinate
> >>> > > FLIP-73 and FLIP-74 on
> >>> > > expose issue.
> >>> > >
> >>> > > Looking forward to your opinions.
> >>> > >
> >>> > > Best,
> >>> > > tison.
> >>> > >
> >>> > > [1]
> >>> > >
> >>>
> https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
> >>> > > [2]
> >>> > >
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API
> >>> >
> >>>
> >>

Re: [DISCUSS] FLIP-74: Flink JobClient API

Reply via email to