Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

Xintong Song Mon, 27 Jun 2022 02:08:43 -0700

Thanks for the updates. LGTM

Best,


Xintong



On Mon, Jun 27, 2022 at 2:48 PM Yangze Guo <karma...@gmail.com> wrote:

> I've updated the FLIP. All of the newly introduced REST APIs will now
> apply to both the JobManager and the HistoryServer.
>
> @Chesnay Schepler @Xintong Song Please take another look at your
> convenience.
>
> Best,
> Yangze Guo
>
>
> On Fri, Jun 24, 2022 at 5:02 PM junhan yang <yangjunhan1...@gmail.com>
> wrote:
> >
> > Distinguish the APIs through the naming of URLs can be a way to prevent
> > confusion. I think we should reconsider our API design based on the
> insight
> > earlier and come up with a thorough explanation or perhaps a better plan
> > about this.
> >
> > Best regards,
> > Junhan
> >
> > Xintong Song <tonysong...@gmail.com> 于2022年6月24日周五 16:27写道：
> >
> > > I see. So you are suggesting the jobmanager to support both /foo/bar
> and
> > > /jobs/:jobid/foo/bar, while the history server only supports the
> latter.
> > >
> > > I was initially thinking having two APIs in jobmanager serving the
> exact
> > > same purpose is a bit tricky. Now I think it's a good point that these
> two
> > > APIs, despite now returning the same results, can return different
> things
> > > in future.
> > >
> > > Junhan & Yangze, WDYT?
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Fri, Jun 24, 2022 at 3:10 PM Chesnay Schepler <ches...@apache.org>
> > > wrote:
> > >
> > > > This is pretty simple to explain.
> > > >
> > > > "I want to know the environment the job ran in." ->
> > > > /jobs/:jobid/environment
> > > > "I want to know the environment the JM ran in." ->
> > > /jobmanager/environment
> > > >
> > > > It's less about the JobID being a parameter, and more of a way for
> them
> > > > to better model the resource they are interested in.
> > > >
> > > > In the future we could consider the job environment endpoint to
> return
> > > > not just the JM environment, but also those from the CLI/TMs.
> > > >
> > > > On 24/06/2022 06:37, Xintong Song wrote:
> > > > > Whether the job ID is actually used in the end isn't visible after
> all.
> > > > >
> > > > > I'm not sure about this. E.g., for an empty session cluster, users
> have
> > > > to
> > > > > understand they don't need to provide an actual jobid for
> requesting
> > > > > jobmanager information via rest.
> > > > >
> > > > > I believe both ways work. I think this is a trade off between a)
> > > > explaining
> > > > > to history server rest api users how the urls are different from
> > > > jobmanager
> > > > > and b) explaining to jobmanager rest api users why we need an
> unused
> > > > jobid
> > > > > for some of the cases. I'm leaning toward the current approach,
> because
> > > > I'd
> > > > > expect a smaller set of history server rest api users than (or
> even a
> > > > > subset of) that of jobmanager.
> > > > >
> > > > > The plan is to document which (and how) the urls are different from
> > > > > jobmanager in the history server page [1].
> > > > >
> > > > > Compatibility test indeed should be considered. Thanks for
> pointing it
> > > > out.
> > > > > Currently the compatibility of history server rest api is
> guaranteed by
> > > > the
> > > > > compatibility of jobmanager rest api. I think the only thing we
> need is
> > > > to
> > > > > make sure /foo/bar of jobmanager is identical to
> /jobs/:jobid/foo/bar
> > > of
> > > > > history server. We can introduce an interface, as a subtype of
> > > > JsonArchivist,
> > > > > that archives the json with a path that includes the jobid. Then
> we can
> > > > > test against all relevant handlers as implementations of this
> > > interface.
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/advanced/historyserver/#available-requests
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 23, 2022 at 5:07 PM Chesnay Schepler <
> ches...@apache.org>
> > > > wrote:
> > > > >
> > > > >> The addition of the /jobs/:jobid/jobmanager/config / environment
> > > > >> exclusively to the HS is a bit of a strange workaround.
> > > > >> How do you intend to document those? (and test compatibility)?
> > > > >>
> > > > >> Why not just add a general /jobs/:jobid/environment endpoint that
> > > works
> > > > >> just like jobmanager/environment.
> > > > >> To me that seems like a cleaner solution.
> > > > >> It is somewhat mentioned as an alternative in the FLIP, but I
> don't
> > > > >> understand what is supposed to be confusing about it.
> > > > >> Whether the job ID is actually used in the end isn't visible after
> > > all.
> > > > >>
> > > > >> /jobmanager/config could be integrated into /jobs/:jobid/config.
> > > > >>
> > > > >> The same approach could maybe be used for logs; not really sure
> yet
> > > (not
> > > > >> a fan of displaying logs in the HS in the first place).
> > > > >>
> > > > >> On 23/06/2022 06:55, junhan yang wrote:
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Thank you all for your feedbacks. As far as I can see, it looks
> like
> > > > the
> > > > >>> discussion on this FLIP has been converged.
> > > > >>>
> > > > >>> I will start a new vote thread now.
> > > > >>>
> > > > >>> Best regards,
> > > > >>> Junhan
> > > > >>>
> > > > >>> Yangze Guo <karma...@gmail.com> 于2022年6月17日周五 14:05写道：
> > > > >>>
> > > > >>>> Thanks for the input, Jiangang.
> > > > >>>>
> > > > >>>> I think it's a valid demand to distinguish completed jobs with
> the
> > > > same
> > > > >>>> name.
> > > > >>>> - If they are different jobs, I think users need to give them
> > > > >>>> different meaningful names respectively.
> > > > >>>> - If they are exactly the same job, IIUC, what you need is to
> figure
> > > > >>>> out the order. ApplicationId in Yarn might help. But in this
> case,
> > > you
> > > > >>>> can just sort them with the start time.
> > > > >>>>
> > > > >>>> Best,
> > > > >>>> Yangze Guo
> > > > >>>>
> > > > >>>> On Fri, Jun 17, 2022 at 12:13 PM Jiangang Liu <
> > > > >> liujiangangp...@gmail.com>
> > > > >>>> wrote:
> > > > >>>>> Thanks for the FLIP. It is helpful to track detail infos for
> > > > completed
> > > > >>>> jobs.
> > > > >>>>> I want to ask another question. In our environment, sometimes
> it is
> > > > >> hard
> > > > >>>> to
> > > > >>>>> distinguish jobs since the same job names may appear multi
> times in
> > > > the
> > > > >>>>> completed jobs. Because a job may run multi times or different
> jobs
> > > > >> have
> > > > >>>>> the same job names. I wonder that wether we can enhance the
> > > complete
> > > > >> jobs
> > > > >>>>> display with more information, such as applicationId and
> > > application
> > > > >> name
> > > > >>>>> in yarn. Maybe it is different in k8s to identify a job.
> > > > >>>>>
> > > > >>>>> Best
> > > > >>>>> Jiangang Liu
> > > > >>>>>
> > > > >>>>> Yangze Guo <karma...@gmail.com> 于2022年6月17日周五 11:40写道：
> > > > >>>>>
> > > > >>>>>> Thanks for the feedback, Aitozi and Jing.
> > > > >>>>>>
> > > > >>>>>>> Are each attempts of the TaskManager or JobManager pods (if
> > > failure
> > > > >>>>>> occurs)
> > > > >>>>>> all be shown in the ui?
> > > > >>>>>>
> > > > >>>>>> The info of the prior execution attempts will be archived, you
> > > could
> > > > >>>>>> refer to `ArchivedExecutionVertex$priorExecutions`.
> > > > >>>>>>
> > > > >>>>>>> It seems that most of these metrics are more interesting to
> batch
> > > > >>>> jobs.
> > > > >>>>>> Does it make sense to calculate them for pure streaming jobs
> too?
> > > > >>>>>>
> > > > >>>>>> All the proposed metrics will be calculated no matter what
> the job
> > > > >>>> type is.
> > > > >>>>>>> Why "duration is less interesting" which is mentioned in the
> > > FLIP?
> > > > >>>>>> As a first step, we mainly focus on the most interesting
> status
> > > > during
> > > > >>>>>> the job lifecycle. The duration of final states like FINISHED
> and
> > > > >>>>>> CANCELED is meaningless, while abnormal conditions like
> CANCELING
> > > > will
> > > > >>>>>> not be included at the moment.
> > > > >>>>>>
> > > > >>>>>>> Could you share your thoughts on "accumulated-busy-time"? It
> > > should
> > > > >>>>>> describe the time while the task is working as expected, i.e.
> the
> > > > >> happy
> > > > >>>>>> path. When do we need it for analytics or diagnosis?
> > > > >>>>>>
> > > > >>>>>> A task could be busy or idle while it is working. Users may
> adjust
> > > > the
> > > > >>>>>> parallelism or the partition key according to the ratio
> between
> > > > them.
> > > > >>>>>>
> > > > >>>>>> Best,
> > > > >>>>>> Yangze Guo
> > > > >>>>>>
> > > > >>>>>> On Fri, Jun 17, 2022 at 5:08 AM Jing Ge <j...@ververica.com>
> > > wrote:
> > > > >>>>>>> Hi Junhan
> > > > >>>>>>>
> > > > >>>>>>> These are must-to-have information for batch processing.
> Thanks
> > > for
> > > > >>>>>>> bringing it up.
> > > > >>>>>>>
> > > > >>>>>>> I have some comments:
> > > > >>>>>>>
> > > > >>>>>>> 1. It seems that most of these metrics are more interesting
> to
> > > > batch
> > > > >>>>>> jobs.
> > > > >>>>>>> Does it make sense to calculate them for pure streaming jobs
> too?
> > > > >>>>>>> 2. Why "duration is less interesting" which is mentioned in
> the
> > > > FLIP?
> > > > >>>>>>> 3. Could you share your thoughts on "accumulated-busy-time"?
> It
> > > > >>>> should
> > > > >>>>>>> describe the time while the task is working as expected,
> i.e. the
> > > > >>>> happy
> > > > >>>>>>> path. When do we need it for analytics or diagnosis?
> > > > >>>>>>>
> > > > >>>>>>> BTW, you might want to optimize the format of the FLIP. Some
> text
> > > > is
> > > > >>>>>>> running out of the right border of the wiki page.
> > > > >>>>>>>
> > > > >>>>>>> Best regards,
> > > > >>>>>>> Jing
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Jun 16, 2022 at 4:40 PM Aitozi <gjying1...@gmail.com
> >
> > > > wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Thanks Junhan for driving this. It a great improvement for
> the
> > > > >>>> batch
> > > > >>>>>> jobs.
> > > > >>>>>>>> I'm looking forward to this feature in our internal use
> case. +1
> > > > >>>> for
> > > > >>>>>> it.
> > > > >>>>>>>> One more question:
> > > > >>>>>>>>
> > > > >>>>>>>> Are each attempts of the TaskManager or JobManager pods (if
> > > > failure
> > > > >>>>>> occurs)
> > > > >>>>>>>> all be shown in the ui ?
> > > > >>>>>>>>
> > > > >>>>>>>> Best,
> > > > >>>>>>>> Aitozi.
> > > > >>>>>>>>
> > > > >>>>>>>> Yang Wang <danrtsey...@gmail.com> 于2022年6月16日周四 19:10写道：
> > > > >>>>>>>>
> > > > >>>>>>>>> Thanks Xintong for the explanation.
> > > > >>>>>>>>>
> > > > >>>>>>>>> It makes sense to leave the discussion about job result
> store
> > > in
> > > > >>>> a
> > > > >>>>>>>>> dedicated thread.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best,
> > > > >>>>>>>>> Yang
> > > > >>>>>>>>>
> > > > >>>>>>>>> Xintong Song <tonysong...@gmail.com> 于2022年6月16日周四
> 13:40写道：
> > > > >>>>>>>>>
> > > > >>>>>>>>>> My impression of JobResultStore is more about fault
> tolerance
> > > > >>>> and
> > > > >>>>>> high
> > > > >>>>>>>>>> availability. Using it for providing information to users
> > > > >>>> sounds
> > > > >>>>>> worth
> > > > >>>>>>>>>> exploring. We probably need more time to think it through.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Given that it doesn't conflict with what we have proposed
> in
> > > > >>>> this
> > > > >>>>>> FLIP,
> > > > >>>>>>>>> I'd
> > > > >>>>>>>>>> suggest considering it as a separate thread and exclude it
> > > > >>>> from the
> > > > >>>>>>>> scope
> > > > >>>>>>>>>> of this one.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Best,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Xintong
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Thu, Jun 16, 2022 at 11:43 AM Yang Wang <
> > > > >>>> danrtsey...@gmail.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>> This is a very useful feature both for finished
> streaming and
> > > > >>>>>> batch
> > > > >>>>>>>>> jobs.
> > > > >>>>>>>>>>> Except for the WebUI & REST API improvements, I am
> curious
> > > > >>>>>> whether we
> > > > >>>>>>>>>> could
> > > > >>>>>>>>>>> also integrate some critical information(e.g. latest
> > > > >>>> checkpoint)
> > > > >>>>>> into
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>> job result store[1].
> > > > >>>>>>>>>>> I am just feeling this is also somehow related with
> > > > >>>> "Completed
> > > > >>>>>> Jobs
> > > > >>>>>>>>>>> Information Enhancement".
> > > > >>>>>>>>>>> And I think the history server is not necessary for all
> the
> > > > >>>>>> scenarios
> > > > >>>>>>>>>>> especially when users only want to check the job
> execution
> > > > >>>>>> result.
> > > > >>>>>>>>>>> [1].
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore
> > > > >>>>>>>>>>> Best,
> > > > >>>>>>>>>>> Yang
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Xintong Song <tonysong...@gmail.com> 于2022年6月15日周三
> 15:37写道：
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks Junhan,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> +1 for the proposed improvements.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Xintong
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Wed, Jun 15, 2022 at 3:16 PM Yangze Guo <
> > > > >>>> karma...@gmail.com
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>>>> Thanks for driving this, Junhan.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I think it's a valuable usability improvement for both
> > > > >>>>>> streaming
> > > > >>>>>>>>> and
> > > > >>>>>>>>>>>>> batch users. Looking forward to the community feedback.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>> Yangze Guo
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Wed, Jun 15, 2022 at 3:10 PM junhan yang <
> > > > >>>>>>>>>> yangjunhan1...@gmail.com>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> I would like to open a discussion on FLIP-241:
> > > > >>>> Completed
> > > > >>>>>> Jobs
> > > > >>>>>>>>>>>> Information
> > > > >>>>>>>>>>>>>> Enhancement.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> As far as we can tell, streaming and batch users have
> > > > >>>>>> different
> > > > >>>>>>>>>>>> interests
> > > > >>>>>>>>>>>>>> in probing a job. As Flink grows into a unified
> > > > >>>> streaming &
> > > > >>>>>>>> batch
> > > > >>>>>>>>>>>>> processor
> > > > >>>>>>>>>>>>>> and is adopted by more and more batch users, the user
> > > > >>>>>>>> experience
> > > > >>>>>>>>> of
> > > > >>>>>>>>>>>>>> completed job's inspection has become more and more
> > > > >>>>>> important.
> > > > >>>>>>>>>> After
> > > > >>>>>>>>>>>>> doing
> > > > >>>>>>>>>>>>>> several market research, there are several potential
> > > > >>>>>>>> improvements
> > > > >>>>>>>>>>>>> spotted.
> > > > >>>>>>>>>>>>>> The main purpose here is due to the involvement of
> > > > >>>> WebUI &
> > > > >>>>>> REST
> > > > >>>>>>>>> API
> > > > >>>>>>>>>>>>>> changes, which should be openly discussed and voted on
> > > > >>>> as
> > > > >>>>>>>> FLIPs.
> > > > >>>>>>>>>>>>>> You can find more details in FLIP-241 document[1].
> > > > >>>> Looking
> > > > >>>>>>>>> forward
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>>>> your feedback.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/dRD1D
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Best regards,
> > > > >>>>>>>>>>>>>> Junhan
> > > > >>
> > > > >>
> > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

Reply via email to