Re: [DISCUSS] FLIP-74: Flink JobClient API

Aljoscha Krettek Fri, 27 Sep 2019 01:48:53 -0700

Hi Flavio,

I agree that this would be good to have. But I also think that this is outside 
the scope of FLIP-74, I think it is an orthogonal feature.


Best,
Aljoscha

> On 27. Sep 2019, at 10:31, Flavio Pompermaier <pomperma...@okkam.it> wrote:
> 
> Hi all,
> just a remark about the Flink REST APIs (and its client as well): almost
> all the times we need a way to dynamically know which jobs are contained in
> a jar file, and this could be exposed by the REST endpoint under
> /jars/:jarid/entry-points (a simple way to implement this would be to check
> the value of Main-class or Main-classes inside the Manifest of the jar if
> they exists [1]).
> 
> I understand that this is something that is not strictly required to
> execute Flink jobs but IMHO it would ease A LOT the work of UI developers
> that could have a way to show the users all available jobs inside a jar +
> their configurable parameters.
> For example, right now in the WebUI, you can upload a jar and then you have
> to set (without any autocomplete or UI support) the main class and their
> params (for example using a string like --param1 xx --param2 yy).
> Adding this functionality to the REST API and the respective client would
> enable the WebUI (and all UIs interacting with a Flink cluster) to prefill
> a dropdown list containing the list of entry-point classes (i.e. Flink
> jobs) and, once selected, their required (typed) parameters.
> 
> Best,
> Flavio
> 
> [1] https://issues.apache.org/jira/browse/FLINK-10864
> 
> On Fri, Sep 27, 2019 at 9:16 AM Zili Chen <wander4...@gmail.com> wrote:
> 
>> modify
>> 
>> /we just shutdown the cluster on the exit of client that running inside
>> cluster/
>> 
>> to
>> 
>> we just shutdown the cluster on both the exit of client that running inside
>> cluster and the finish of job.
>> Since client is running inside cluster we can easily wait for the end of
>> two both in ClusterEntrypoint.
>> 
>> 
>> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午3:13写道：
>> 
>>> About JobCluster
>>> 
>>> Actually I am not quite sure what we gains from DETACHED configuration on
>>> cluster side.
>>> We don't have a NON-DETACHED JobCluster in fact in our codebase, right?
>>> 
>>> It comes to me one major questions we have to answer first.
>>> 
>>> *What JobCluster conceptually is exactly*
>>> 
>>> Related discussion can be found in JIRA[1] and mailing list[2]. Stephan
>>> gives a nice
>>> description of JobCluster:
>>> 
>>> Two things to add: - The job mode is very nice in the way that it runs
>> the
>>> client inside the cluster (in the same image/process that is the JM) and
>>> thus unifies both applications and what the Spark world calls the "driver
>>> mode". - Another thing I would add is that during the FLIP-6 design, we
>>> were thinking about setups where Dispatcher and JobManager are separate
>>> processes. A Yarn or Mesos Dispatcher of a session could run
>> independently
>>> (even as privileged processes executing no code). Then you the "per-job"
>>> mode could still be helpful: when a job is submitted to the dispatcher,
>> it
>>> launches the JM again in a per-job mode, so that JM and TM processes are
>>> bound to teh job only. For higher security setups, it is important that
>>> processes are not reused across jobs.
>>> 
>>> However, currently in "per-job" mode we generate JobGraph in client side,
>>> launching
>>> the JobCluster and retrieve the JobGraph for execution. So actually, we
>>> don't "run the
>>> client inside the cluster".
>>> 
>>> Besides, refer to the discussion with Till[1], it would be helpful we
>>> follow the same process
>>> of session mode for that of "per-job" mode in user perspective, that we
>>> don't use
>>> OptimizedPlanEnvironment to create JobGraph, but directly deploy Flink
>>> cluster in env.execute.
>>> 
>>> Generally 2 points
>>> 
>>> 1. Running Flink job by invoke user main method and execute throughout,
>>> instead of create
>>> JobGraph from main-class.
>>> 2. Run the client inside the cluster.
>>> 
>>> If 1 and 2 are implemented. There is obvious no need for DETACHED mode in
>>> cluster side
>>> because we just shutdown the cluster on the exit of client that running
>>> inside cluster. Whether
>>> or not delivered the result is up to user code.
>>> 
>>> [1]
>>> 
>> https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16931388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931388
>>> [2]
>>> 
>> https://lists.apache.org/x/thread.html/e8f14a381be6c027e8945f884c3cfcb309ce49c1ba557d3749fca495@%3Cdev.flink.apache.org%3E
>>> 
>>> 
>>> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午2:13写道：
>>> 
>>>> Thanks for your replies Kostas & Aljoscha!
>>>> 
>>>> Below are replies point by point.
>>>> 
>>>> 1. For DETACHED mode, what I said there is about the DETACHED mode in
>>>> client side.
>>>> There are two configurations overload the item DETACHED[1].
>>>> 
>>>> In client side, it means whether or not client.submitJob is blocking to
>>>> job execution result.
>>>> Due to client.submitJob returns CompletableFuture<JobClient>
>> NON-DETACHED
>>>> is no
>>>> power at all. Caller of submitJob makes the decision whether or not
>>>> blocking to get the
>>>> JobClient and request for the job execution result. If client crashes,
>> it
>>>> is a user scope
>>>> exception that should be handled in user code; if client lost connection
>>>> to cluster, we have
>>>> a retry times and interval configuration that automatically retry and
>>>> throws an user scope
>>>> exception if exceed.
>>>> 
>>>> Your comment about poll for result or job result sounds like a concern
>> on
>>>> cluster side.
>>>> 
>>>> In cluster side, DETACHED mode is alive only in JobCluster. If DETACHED
>>>> configured,
>>>> JobCluster exits on job finished; if NON-DETACHED configured, JobCluster
>>>> exits on job
>>>> execution result delivered. FLIP-74 doesn't stick to changes on this
>>>> scope, it is just remained.
>>>> 
>>>> However, it is an interesting part we can revisit this implementation a
>>>> bit.
>>>> 
>>>> <see the next email for compact reply in this one>
>>>> 
>>>> 2. The retrieval of JobClient is so important that if we don't have a
>> way
>>>> to retrieve JobClient it is
>>>> a dumb public user-facing interface(what a strange state :P).
>>>> 
>>>> About the retrieval of JobClient, as mentioned in the document, two ways
>>>> should be supported.
>>>> 
>>>> (1). Retrieved as return type of job submission.
>>>> (2). Retrieve a JobClient of existing job.(with job id)
>>>> 
>>>> I highly respect your thoughts about how Executors should be and
>> thoughts
>>>> on multi-layered clients.
>>>> Although, (2) is not supported by public interfaces as summary of
>>>> discussion above, we can discuss
>>>> a bit on the place of Executors on multi-layered clients and find a way
>>>> to retrieve JobClient of
>>>> existing job with public client API. I will comment in FLIP-73 thread[2]
>>>> since it is almost about Executors.
>>>> 
>>>> Best,
>>>> tison.
>>>> 
>>>> [1]
>>>> 
>> https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit?disco=AAAADnLLvM8
>>>> [2]
>>>> 
>> https://lists.apache.org/x/thread.html/dc3a541709f96906b43df4155373af1cd09e08c3f105b0bd0ba3fca2@%3Cdev.flink.apache.org%3E
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Kostas Kloudas <kklou...@gmail.com> 于2019年9月25日周三 下午9:29写道：
>>>> 
>>>>> Hi Tison,
>>>>> 
>>>>> Thanks for the FLIP and launching the discussion!
>>>>> 
>>>>> As a first note, big +1 on providing/exposing a JobClient to the users!
>>>>> 
>>>>> Some points that would be nice to be clarified:
>>>>> 1) You mention that we can get rid of the DETACHED mode: I agree that
>>>>> at a high level, given that everything will now be asynchronous, there
>>>>> is no need to keep the DETACHED mode but I think we should specify
>>>>> some aspects. For example, without the explicit separation of the
>>>>> modes, what happens when the job finishes. Does the client
>>>>> periodically poll for the result always or the result is pushed when
>>>>> in NON-DETACHED mode? What happens if the client disconnects and
>>>>> reconnects?
>>>>> 
>>>>> 2) On the "how to retrieve a JobClient for a running Job", I think
>>>>> this is related to the other discussion you opened in the ML about
>>>>> multi-layered clients. First of all, I agree that exposing different
>>>>> "levels" of clients would be a nice addition, and actually there have
>>>>> been some discussions about doing so in the future. Now for this
>>>>> specific discussion:
>>>>>      i) I do not think that we should expose the
>>>>> ClusterDescriptor/ClusterSpecification to the user, as this ties us to
>>>>> a specific architecture which may change in the future.
>>>>>     ii) I do not think it should be the Executor that will provide a
>>>>> JobClient for an already running job (only for the Jobs that it
>>>>> submits). The job of the executor should just be to execute() a
>>>>> pipeline.
>>>>>     iii) I think a solution that respects the separation of concerns
>>>>> could be the addition of another component (in the future), something
>>>>> like a ClientFactory, or ClusterFactory that will have methods like:
>>>>> ClusterClient createCluster(Configuration), JobClient
>>>>> retrieveJobClient(Configuration , JobId), maybe even (although not
>>>>> sure) Executor getExecutor(Configuration ) and maybe more. This
>>>>> component would be responsible to interact with a cluster manager like
>>>>> Yarn and do what is now being done by the ClusterDescriptor plus some
>>>>> more stuff.
>>>>> 
>>>>> Although under the hood all these abstractions (Environments,
>>>>> Executors, ...) underneath use the same clients, I believe their
>>>>> job/existence is not contradicting but they simply hide some of the
>>>>> complexity from the user, and give us, as developers some freedom to
>>>>> change in the future some of the parts. For example, the executor will
>>>>> take a Pipeline, create a JobGraph and submit it, instead of requiring
>>>>> the user to do each step separately. This allows us to, for example,
>>>>> get rid of the Plan if in the future everything is DataStream.
>>>>> Essentially, I think of these as layers of an onion with the clients
>>>>> being close to the core. The higher you go, the more functionality is
>>>>> included and hidden from the public eye.
>>>>> 
>>>>> Point iii) by the way is just a thought and by no means final. I also
>>>>> like the idea of multi-layered clients so this may spark up the
>>>>> discussion.
>>>>> 
>>>>> Cheers,
>>>>> Kostas
>>>>> 
>>>>> On Wed, Sep 25, 2019 at 2:21 PM Aljoscha Krettek <aljos...@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Tison,
>>>>>> 
>>>>>> Thanks for proposing the document! I had some comments on the
>> document.
>>>>>> 
>>>>>> I think the only complex thing that we still need to figure out is
>> how
>>>>> to get a JobClient for a job that is already running. As you mentioned
>> in
>>>>> the document. Currently I’m thinking that its ok to add a method to
>>>>> Executor for retrieving a JobClient for a running job by providing an
>> ID.
>>>>> Let’s see what Kostas has to say on the topic.
>>>>>> 
>>>>>> Best,
>>>>>> Aljoscha
>>>>>> 
>>>>>>> On 25. Sep 2019, at 12:31, Zili Chen <wander4...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> Summary from the discussion about introducing Flink JobClient
>> API[1]
>>>>> we
>>>>>>> draft FLIP-74[2] to
>>>>>>> gather thoughts and towards a standard public user-facing
>> interfaces.
>>>>>>> 
>>>>>>> This discussion thread aims at standardizing job level client API.
>>>>> But I'd
>>>>>>> like to emphasize that
>>>>>>> how to retrieve JobClient possibly causes further discussion on
>>>>> different
>>>>>>> level clients exposed from
>>>>>>> Flink so that a following thread will be started later to
>> coordinate
>>>>>>> FLIP-73 and FLIP-74 on
>>>>>>> expose issue.
>>>>>>> 
>>>>>>> Looking forward to your opinions.
>>>>>>> 
>>>>>>> Best,
>>>>>>> tison.
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>> https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
>>>>>>> [2]
>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API
>>>>>> 
>>>>> 
>>>>

Re: [DISCUSS] FLIP-74: Flink JobClient API

Reply via email to