Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

Xintong Song Thu, 23 Jun 2022 21:38:11 -0700

Whether the job ID is actually used in the end isn't visible after all.

I'm not sure about this. E.g., for an empty session cluster, users have to
understand they don't need to provide an actual jobid for requesting
jobmanager information via rest.


I believe both ways work. I think this is a trade off between a) explaining
to history server rest api users how the urls are different from jobmanager
and b) explaining to jobmanager rest api users why we need an unused jobid
for some of the cases. I'm leaning toward the current approach, because I'd
expect a smaller set of history server rest api users than (or even a
subset of) that of jobmanager.

The plan is to document which (and how) the urls are different from
jobmanager in the history server page [1].

Compatibility test indeed should be considered. Thanks for pointing it out.
Currently the compatibility of history server rest api is guaranteed by the
compatibility of jobmanager rest api. I think the only thing we need is to
make sure /foo/bar of jobmanager is identical to /jobs/:jobid/foo/bar of
history server. We can introduce an interface, as a subtype of JsonArchivist,
that archives the json with a path that includes the jobid. Then we can
test against all relevant handlers as implementations of this interface.

WDYT?

Best,

Xintong


[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/advanced/historyserver/#available-requests



On Thu, Jun 23, 2022 at 5:07 PM Chesnay Schepler <[email protected]> wrote:

> The addition of the /jobs/:jobid/jobmanager/config / environment
> exclusively to the HS is a bit of a strange workaround.
> How do you intend to document those? (and test compatibility)?
>
> Why not just add a general /jobs/:jobid/environment endpoint that works
> just like jobmanager/environment.
> To me that seems like a cleaner solution.
> It is somewhat mentioned as an alternative in the FLIP, but I don't
> understand what is supposed to be confusing about it.
> Whether the job ID is actually used in the end isn't visible after all.
>
> /jobmanager/config could be integrated into /jobs/:jobid/config.
>
> The same approach could maybe be used for logs; not really sure yet (not
> a fan of displaying logs in the HS in the first place).
>
> On 23/06/2022 06:55, junhan yang wrote:
> > Hi all,
> >
> > Thank you all for your feedbacks. As far as I can see, it looks like the
> > discussion on this FLIP has been converged.
> >
> > I will start a new vote thread now.
> >
> > Best regards,
> > Junhan
> >
> > Yangze Guo <[email protected]> 于2022年6月17日周五 14:05写道：
> >
> >> Thanks for the input, Jiangang.
> >>
> >> I think it's a valid demand to distinguish completed jobs with the same
> >> name.
> >> - If they are different jobs, I think users need to give them
> >> different meaningful names respectively.
> >> - If they are exactly the same job, IIUC, what you need is to figure
> >> out the order. ApplicationId in Yarn might help. But in this case, you
> >> can just sort them with the start time.
> >>
> >> Best,
> >> Yangze Guo
> >>
> >> On Fri, Jun 17, 2022 at 12:13 PM Jiangang Liu <
> [email protected]>
> >> wrote:
> >>> Thanks for the FLIP. It is helpful to track detail infos for completed
> >> jobs.
> >>> I want to ask another question. In our environment, sometimes it is
> hard
> >> to
> >>> distinguish jobs since the same job names may appear multi times in the
> >>> completed jobs. Because a job may run multi times or different jobs
> have
> >>> the same job names. I wonder that wether we can enhance the complete
> jobs
> >>> display with more information, such as applicationId and application
> name
> >>> in yarn. Maybe it is different in k8s to identify a job.
> >>>
> >>> Best
> >>> Jiangang Liu
> >>>
> >>> Yangze Guo <[email protected]> 于2022年6月17日周五 11:40写道：
> >>>
> >>>> Thanks for the feedback, Aitozi and Jing.
> >>>>
> >>>>> Are each attempts of the TaskManager or JobManager pods (if failure
> >>>> occurs)
> >>>> all be shown in the ui?
> >>>>
> >>>> The info of the prior execution attempts will be archived, you could
> >>>> refer to `ArchivedExecutionVertex$priorExecutions`.
> >>>>
> >>>>> It seems that most of these metrics are more interesting to batch
> >> jobs.
> >>>> Does it make sense to calculate them for pure streaming jobs too?
> >>>>
> >>>> All the proposed metrics will be calculated no matter what the job
> >> type is.
> >>>>> Why "duration is less interesting" which is mentioned in the FLIP?
> >>>> As a first step, we mainly focus on the most interesting status during
> >>>> the job lifecycle. The duration of final states like FINISHED and
> >>>> CANCELED is meaningless, while abnormal conditions like CANCELING will
> >>>> not be included at the moment.
> >>>>
> >>>>> Could you share your thoughts on "accumulated-busy-time"? It should
> >>>> describe the time while the task is working as expected, i.e. the
> happy
> >>>> path. When do we need it for analytics or diagnosis?
> >>>>
> >>>> A task could be busy or idle while it is working. Users may adjust the
> >>>> parallelism or the partition key according to the ratio between them.
> >>>>
> >>>> Best,
> >>>> Yangze Guo
> >>>>
> >>>> On Fri, Jun 17, 2022 at 5:08 AM Jing Ge <[email protected]> wrote:
> >>>>> Hi Junhan
> >>>>>
> >>>>> These are must-to-have information for batch processing. Thanks for
> >>>>> bringing it up.
> >>>>>
> >>>>> I have some comments:
> >>>>>
> >>>>> 1. It seems that most of these metrics are more interesting to batch
> >>>> jobs.
> >>>>> Does it make sense to calculate them for pure streaming jobs too?
> >>>>> 2. Why "duration is less interesting" which is mentioned in the FLIP?
> >>>>> 3. Could you share your thoughts on "accumulated-busy-time"? It
> >> should
> >>>>> describe the time while the task is working as expected, i.e. the
> >> happy
> >>>>> path. When do we need it for analytics or diagnosis?
> >>>>>
> >>>>> BTW, you might want to optimize the format of the FLIP. Some text is
> >>>>> running out of the right border of the wiki page.
> >>>>>
> >>>>> Best regards,
> >>>>> Jing
> >>>>>
> >>>>> On Thu, Jun 16, 2022 at 4:40 PM Aitozi <[email protected]> wrote:
> >>>>>
> >>>>>> Thanks Junhan for driving this. It a great improvement for the
> >> batch
> >>>> jobs.
> >>>>>> I'm looking forward to this feature in our internal use case. +1
> >> for
> >>>> it.
> >>>>>> One more question:
> >>>>>>
> >>>>>> Are each attempts of the TaskManager or JobManager pods (if failure
> >>>> occurs)
> >>>>>> all be shown in the ui ?
> >>>>>>
> >>>>>> Best,
> >>>>>> Aitozi.
> >>>>>>
> >>>>>> Yang Wang <[email protected]> 于2022年6月16日周四 19:10写道：
> >>>>>>
> >>>>>>> Thanks Xintong for the explanation.
> >>>>>>>
> >>>>>>> It makes sense to leave the discussion about job result store in
> >> a
> >>>>>>> dedicated thread.
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Yang
> >>>>>>>
> >>>>>>> Xintong Song <[email protected]> 于2022年6月16日周四 13:40写道：
> >>>>>>>
> >>>>>>>> My impression of JobResultStore is more about fault tolerance
> >> and
> >>>> high
> >>>>>>>> availability. Using it for providing information to users
> >> sounds
> >>>> worth
> >>>>>>>> exploring. We probably need more time to think it through.
> >>>>>>>>
> >>>>>>>> Given that it doesn't conflict with what we have proposed in
> >> this
> >>>> FLIP,
> >>>>>>> I'd
> >>>>>>>> suggest considering it as a separate thread and exclude it
> >> from the
> >>>>>> scope
> >>>>>>>> of this one.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Xintong
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jun 16, 2022 at 11:43 AM Yang Wang <
> >> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>> This is a very useful feature both for finished streaming and
> >>>> batch
> >>>>>>> jobs.
> >>>>>>>>> Except for the WebUI & REST API improvements, I am curious
> >>>> whether we
> >>>>>>>> could
> >>>>>>>>> also integrate some critical information(e.g. latest
> >> checkpoint)
> >>>> into
> >>>>>>> the
> >>>>>>>>> job result store[1].
> >>>>>>>>> I am just feeling this is also somehow related with
> >> "Completed
> >>>> Jobs
> >>>>>>>>> Information Enhancement".
> >>>>>>>>> And I think the history server is not necessary for all the
> >>>> scenarios
> >>>>>>>>> especially when users only want to check the job execution
> >>>> result.
> >>>>>>>>> [1].
> >>>>>>>>>
> >>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Yang
> >>>>>>>>>
> >>>>>>>>> Xintong Song <[email protected]> 于2022年6月15日周三 15:37写道：
> >>>>>>>>>
> >>>>>>>>>> Thanks Junhan,
> >>>>>>>>>>
> >>>>>>>>>> +1 for the proposed improvements.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Xintong
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Jun 15, 2022 at 3:16 PM Yangze Guo <
> >> [email protected]
> >>>>>>> wrote:
> >>>>>>>>>>> Thanks for driving this, Junhan.
> >>>>>>>>>>>
> >>>>>>>>>>> I think it's a valuable usability improvement for both
> >>>> streaming
> >>>>>>> and
> >>>>>>>>>>> batch users. Looking forward to the community feedback.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Yangze Guo
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 15, 2022 at 3:10 PM junhan yang <
> >>>>>>>> [email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would like to open a discussion on FLIP-241:
> >> Completed
> >>>> Jobs
> >>>>>>>>>> Information
> >>>>>>>>>>>> Enhancement.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As far as we can tell, streaming and batch users have
> >>>> different
> >>>>>>>>>> interests
> >>>>>>>>>>>> in probing a job. As Flink grows into a unified
> >> streaming &
> >>>>>> batch
> >>>>>>>>>>> processor
> >>>>>>>>>>>> and is adopted by more and more batch users, the user
> >>>>>> experience
> >>>>>>> of
> >>>>>>>>>>>> completed job's inspection has become more and more
> >>>> important.
> >>>>>>>> After
> >>>>>>>>>>> doing
> >>>>>>>>>>>> several market research, there are several potential
> >>>>>> improvements
> >>>>>>>>>>> spotted.
> >>>>>>>>>>>> The main purpose here is due to the involvement of
> >> WebUI &
> >>>> REST
> >>>>>>> API
> >>>>>>>>>>>> changes, which should be openly discussed and voted on
> >> as
> >>>>>> FLIPs.
> >>>>>>>>>>>> You can find more details in FLIP-241 document[1].
> >> Looking
> >>>>>>> forward
> >>>>>>>> to
> >>>>>>>>>>>> your feedback.
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/dRD1D
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>> Junhan
>
>
>

Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

Reply via email to