Re: [DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-10-24 Thread Matthias Pohl
Hi Gyula,
thanks for joining the discussion. Yeah, I might have gone a bit too far
with it. I guess I was eager to create separate implementations to make use
of the two different interfaces/purposes on a class level to allow more
purpose-centric testing.

I am fine with cutting it back and having separate interfaces being
implemented by the same class as @Shammon proposed. I updated the FLIP
accordingly.

Matthias

On Mon, Oct 23, 2023 at 2:45 PM Gyula Fóra  wrote:

> I am a bit confused by the split in the CompletedJobStore /
> JobDetailsStore.
> Seems like JobDetailsStore is simply a view on top of CompletedJobStore:
>  - Maybe we should not call it a store? Is it storing anything?
>  - Why couldn't the cleanup triggering be the responsibility of the
> CompletedJobStore, wouldn't it make it simpler to have the storage/cleanup
> related logic in a simple place?
>  - Ideally the JobDetailsStore / JobDetailsProvider could be a very thin
> interface exposed by the CompletedJobStore
>
> Gyula
>
> On Sat, Sep 30, 2023 at 2:18 AM Matthias Pohl
>  wrote:
>
> > Thanks for sharing your thoughts, Shammon FY. I kind of see your point.
> >
> > Initially, I didn't put much thought into splitting up the interface into
> > two because the dispatcher would have been the only component dealing
> with
> > the two interfaces. Having two interfaces wouldn't have added much value
> > (in terms of testability and readability, I thought).
> >
> > But I iterated over the idea once more and came up with a new proposal
> that
> > involves the two components CompletedJobStore and JobDetailsStore. It's
> not
> > 100% what you suggested (because the retrieval of the ExecutionGraphInfo
> > still lives in the CompletedJobStore) but the separation makes sense in
> my
> > opinion:
> > - The CompletedJobStore deals with the big data that might require
> > accessing the disk.
> > - The JobDetailsStore handles the small-footprint data that lives in
> memory
> > all the time. Additionally, it takes care of actually deleting the
> metadata
> > of the completed job in both stores if a TTL is configured.
> >
> > See FLIP-360 [1] with the newly added class and sequence diagrams and
> > additional content. I only updated the Interfaces & Methods section (see
> > diff [2]).
> >
> > I'm looking forward to feedback.
> >
> > Best,
> > Matthias
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+merging+the+executiongraphinfostore+and+the+jobresultstore+into+a+single+component+completedjobstore
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=263428420=14=13
> >
> > On Mon, Sep 18, 2023 at 1:20 AM Shammon FY  wrote:
> >
> > > Hi Matthias,
> > >
> > > Thanks for initiating this discussion, and +1 for overall from my side.
> > > It's really strange to have two different places to store completed
> jobs,
> > > this also brings about the complexity of Flink. I agree with using a
> > > unified instance to store the completed job information.
> > >
> > > In terms of ability, `ExecutionGraphInfoStore` and `JobResultStore` are
> > > different: one is mainly used for information display, and the other is
> > for
> > > failover. So after unifying storage, can we use different interfaces to
> > > meet the different requirements for jobs? Adding all these methods for
> > > different components into one interface such as `CompletedJobStore` may
> > be
> > > a little strange. What do you think?
> > >
> > > Best,
> > > Shammon FY
> > >
> > >
> > >
> > > On Fri, Sep 8, 2023 at 8:08 PM Gyula Fóra 
> wrote:
> > >
> > > > Hi Matthias!
> > > >
> > > > Thank you for the detailed proposal, overall I am in favor of making
> > this
> > > > unification to simplify the logic and make the integration for
> external
> > > > components more straightforward.
> > > > I will try to read through the proposal more carefully next week and
> > > > provide some detailed feedback.
> > > >
> > > > +1
> > > >
> > > > Thanks
> > > > Gyula
> > > >
> > > > On Fri, Sep 8, 2023 at 8:36 AM Matthias Pohl  > > > .invalid>
> > > > wrote:
> > > >
> > > > > Just a bit more elaboration on the question that we need to answer
> > > here:
> > > > Do
> > > > > we want to expose the internal ArchivedExecutionGraph data
> structure
> > > > > through JSON?
> > > > >
> > > > > - The JSON approach allows the user to have (almost) full access to
> > the
> > > > > information (that would be otherwise derived from the REST API).
> > > > Therefore,
> > > > > there's no need to spin up a cluster to access this information.
> > > > > Any information that shall be exposed through the REST API needs to
> > be
> > > > > well-defined in this JSON structure, though. Large parts of the
> > > > > ArchivedExecutionGraph data structure (essentially anything that
> > shall
> > > be
> > > > > used to populate the REST API) become public domain, though, which
> > puts
> > > > > more constraints on this data structure and makes it harder to
> change
> > 

Re: [DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-10-23 Thread Gyula Fóra
I am a bit confused by the split in the CompletedJobStore / JobDetailsStore.
Seems like JobDetailsStore is simply a view on top of CompletedJobStore:
 - Maybe we should not call it a store? Is it storing anything?
 - Why couldn't the cleanup triggering be the responsibility of the
CompletedJobStore, wouldn't it make it simpler to have the storage/cleanup
related logic in a simple place?
 - Ideally the JobDetailsStore / JobDetailsProvider could be a very thin
interface exposed by the CompletedJobStore

Gyula

On Sat, Sep 30, 2023 at 2:18 AM Matthias Pohl
 wrote:

> Thanks for sharing your thoughts, Shammon FY. I kind of see your point.
>
> Initially, I didn't put much thought into splitting up the interface into
> two because the dispatcher would have been the only component dealing with
> the two interfaces. Having two interfaces wouldn't have added much value
> (in terms of testability and readability, I thought).
>
> But I iterated over the idea once more and came up with a new proposal that
> involves the two components CompletedJobStore and JobDetailsStore. It's not
> 100% what you suggested (because the retrieval of the ExecutionGraphInfo
> still lives in the CompletedJobStore) but the separation makes sense in my
> opinion:
> - The CompletedJobStore deals with the big data that might require
> accessing the disk.
> - The JobDetailsStore handles the small-footprint data that lives in memory
> all the time. Additionally, it takes care of actually deleting the metadata
> of the completed job in both stores if a TTL is configured.
>
> See FLIP-360 [1] with the newly added class and sequence diagrams and
> additional content. I only updated the Interfaces & Methods section (see
> diff [2]).
>
> I'm looking forward to feedback.
>
> Best,
> Matthias
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+merging+the+executiongraphinfostore+and+the+jobresultstore+into+a+single+component+completedjobstore
> [2]
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=263428420=14=13
>
> On Mon, Sep 18, 2023 at 1:20 AM Shammon FY  wrote:
>
> > Hi Matthias,
> >
> > Thanks for initiating this discussion, and +1 for overall from my side.
> > It's really strange to have two different places to store completed jobs,
> > this also brings about the complexity of Flink. I agree with using a
> > unified instance to store the completed job information.
> >
> > In terms of ability, `ExecutionGraphInfoStore` and `JobResultStore` are
> > different: one is mainly used for information display, and the other is
> for
> > failover. So after unifying storage, can we use different interfaces to
> > meet the different requirements for jobs? Adding all these methods for
> > different components into one interface such as `CompletedJobStore` may
> be
> > a little strange. What do you think?
> >
> > Best,
> > Shammon FY
> >
> >
> >
> > On Fri, Sep 8, 2023 at 8:08 PM Gyula Fóra  wrote:
> >
> > > Hi Matthias!
> > >
> > > Thank you for the detailed proposal, overall I am in favor of making
> this
> > > unification to simplify the logic and make the integration for external
> > > components more straightforward.
> > > I will try to read through the proposal more carefully next week and
> > > provide some detailed feedback.
> > >
> > > +1
> > >
> > > Thanks
> > > Gyula
> > >
> > > On Fri, Sep 8, 2023 at 8:36 AM Matthias Pohl  > > .invalid>
> > > wrote:
> > >
> > > > Just a bit more elaboration on the question that we need to answer
> > here:
> > > Do
> > > > we want to expose the internal ArchivedExecutionGraph data structure
> > > > through JSON?
> > > >
> > > > - The JSON approach allows the user to have (almost) full access to
> the
> > > > information (that would be otherwise derived from the REST API).
> > > Therefore,
> > > > there's no need to spin up a cluster to access this information.
> > > > Any information that shall be exposed through the REST API needs to
> be
> > > > well-defined in this JSON structure, though. Large parts of the
> > > > ArchivedExecutionGraph data structure (essentially anything that
> shall
> > be
> > > > used to populate the REST API) become public domain, though, which
> puts
> > > > more constraints on this data structure and makes it harder to change
> > it
> > > in
> > > > the future.
> > > >
> > > > - The binary data approach allows us to keep the data structure
> itself
> > > > internal. We have more control over what we want to expose by
> providing
> > > > access points in the ClusterClient (e.g. just add a command to
> extract
> > > the
> > > > external storage path from the file).
> > > >
> > > > - The compromise (i.e. keeping ExecutionGraphInfoStore and
> > JobResultStore
> > > > separate and just expose the checkpoint information next to the
> > JobResult
> > > > in the JobResultStore file) would keep us the closest to the current
> > > state,
> > > > requires the least code changes and the least exposure of internal
> data
> > > > structures. It 

Re: [DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-09-29 Thread Matthias Pohl
Thanks for sharing your thoughts, Shammon FY. I kind of see your point.

Initially, I didn't put much thought into splitting up the interface into
two because the dispatcher would have been the only component dealing with
the two interfaces. Having two interfaces wouldn't have added much value
(in terms of testability and readability, I thought).

But I iterated over the idea once more and came up with a new proposal that
involves the two components CompletedJobStore and JobDetailsStore. It's not
100% what you suggested (because the retrieval of the ExecutionGraphInfo
still lives in the CompletedJobStore) but the separation makes sense in my
opinion:
- The CompletedJobStore deals with the big data that might require
accessing the disk.
- The JobDetailsStore handles the small-footprint data that lives in memory
all the time. Additionally, it takes care of actually deleting the metadata
of the completed job in both stores if a TTL is configured.

See FLIP-360 [1] with the newly added class and sequence diagrams and
additional content. I only updated the Interfaces & Methods section (see
diff [2]).

I'm looking forward to feedback.

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+merging+the+executiongraphinfostore+and+the+jobresultstore+into+a+single+component+completedjobstore
[2]
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=263428420=14=13

On Mon, Sep 18, 2023 at 1:20 AM Shammon FY  wrote:

> Hi Matthias,
>
> Thanks for initiating this discussion, and +1 for overall from my side.
> It's really strange to have two different places to store completed jobs,
> this also brings about the complexity of Flink. I agree with using a
> unified instance to store the completed job information.
>
> In terms of ability, `ExecutionGraphInfoStore` and `JobResultStore` are
> different: one is mainly used for information display, and the other is for
> failover. So after unifying storage, can we use different interfaces to
> meet the different requirements for jobs? Adding all these methods for
> different components into one interface such as `CompletedJobStore` may be
> a little strange. What do you think?
>
> Best,
> Shammon FY
>
>
>
> On Fri, Sep 8, 2023 at 8:08 PM Gyula Fóra  wrote:
>
> > Hi Matthias!
> >
> > Thank you for the detailed proposal, overall I am in favor of making this
> > unification to simplify the logic and make the integration for external
> > components more straightforward.
> > I will try to read through the proposal more carefully next week and
> > provide some detailed feedback.
> >
> > +1
> >
> > Thanks
> > Gyula
> >
> > On Fri, Sep 8, 2023 at 8:36 AM Matthias Pohl  > .invalid>
> > wrote:
> >
> > > Just a bit more elaboration on the question that we need to answer
> here:
> > Do
> > > we want to expose the internal ArchivedExecutionGraph data structure
> > > through JSON?
> > >
> > > - The JSON approach allows the user to have (almost) full access to the
> > > information (that would be otherwise derived from the REST API).
> > Therefore,
> > > there's no need to spin up a cluster to access this information.
> > > Any information that shall be exposed through the REST API needs to be
> > > well-defined in this JSON structure, though. Large parts of the
> > > ArchivedExecutionGraph data structure (essentially anything that shall
> be
> > > used to populate the REST API) become public domain, though, which puts
> > > more constraints on this data structure and makes it harder to change
> it
> > in
> > > the future.
> > >
> > > - The binary data approach allows us to keep the data structure itself
> > > internal. We have more control over what we want to expose by providing
> > > access points in the ClusterClient (e.g. just add a command to extract
> > the
> > > external storage path from the file).
> > >
> > > - The compromise (i.e. keeping ExecutionGraphInfoStore and
> JobResultStore
> > > separate and just expose the checkpoint information next to the
> JobResult
> > > in the JobResultStore file) would keep us the closest to the current
> > state,
> > > requires the least code changes and the least exposure of internal data
> > > structures. It would allow any system (like the Kubernetes Operator) to
> > > extract the checkpoint's external storage path. But we would still be
> > stuck
> > > with kind-of redundant components.
> > >
> > > From a user's perspective, I feel like the JSON approach is the best
> one
> > > because it gives him/her the most freedom to be independent of Flink
> > > binaries when handling completed jobs. But I see benefits from a Flink
> > > developer's perspective to not expose the entire data structure but use
> > the
> > > ClusterClient as an access point.
> > >
> > > The last option is my least favorite one: Moving the ExecutionGraphInfo
> > out
> > > of the JobManager seems to be the right thing to do when thinking about
> > > Flink's vision to become cloud-native.
> > >
> > > Just my 2cts on that 

Re: [DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-09-17 Thread Shammon FY
Hi Matthias,

Thanks for initiating this discussion, and +1 for overall from my side.
It's really strange to have two different places to store completed jobs,
this also brings about the complexity of Flink. I agree with using a
unified instance to store the completed job information.

In terms of ability, `ExecutionGraphInfoStore` and `JobResultStore` are
different: one is mainly used for information display, and the other is for
failover. So after unifying storage, can we use different interfaces to
meet the different requirements for jobs? Adding all these methods for
different components into one interface such as `CompletedJobStore` may be
a little strange. What do you think?

Best,
Shammon FY



On Fri, Sep 8, 2023 at 8:08 PM Gyula Fóra  wrote:

> Hi Matthias!
>
> Thank you for the detailed proposal, overall I am in favor of making this
> unification to simplify the logic and make the integration for external
> components more straightforward.
> I will try to read through the proposal more carefully next week and
> provide some detailed feedback.
>
> +1
>
> Thanks
> Gyula
>
> On Fri, Sep 8, 2023 at 8:36 AM Matthias Pohl  .invalid>
> wrote:
>
> > Just a bit more elaboration on the question that we need to answer here:
> Do
> > we want to expose the internal ArchivedExecutionGraph data structure
> > through JSON?
> >
> > - The JSON approach allows the user to have (almost) full access to the
> > information (that would be otherwise derived from the REST API).
> Therefore,
> > there's no need to spin up a cluster to access this information.
> > Any information that shall be exposed through the REST API needs to be
> > well-defined in this JSON structure, though. Large parts of the
> > ArchivedExecutionGraph data structure (essentially anything that shall be
> > used to populate the REST API) become public domain, though, which puts
> > more constraints on this data structure and makes it harder to change it
> in
> > the future.
> >
> > - The binary data approach allows us to keep the data structure itself
> > internal. We have more control over what we want to expose by providing
> > access points in the ClusterClient (e.g. just add a command to extract
> the
> > external storage path from the file).
> >
> > - The compromise (i.e. keeping ExecutionGraphInfoStore and JobResultStore
> > separate and just expose the checkpoint information next to the JobResult
> > in the JobResultStore file) would keep us the closest to the current
> state,
> > requires the least code changes and the least exposure of internal data
> > structures. It would allow any system (like the Kubernetes Operator) to
> > extract the checkpoint's external storage path. But we would still be
> stuck
> > with kind-of redundant components.
> >
> > From a user's perspective, I feel like the JSON approach is the best one
> > because it gives him/her the most freedom to be independent of Flink
> > binaries when handling completed jobs. But I see benefits from a Flink
> > developer's perspective to not expose the entire data structure but use
> the
> > ClusterClient as an access point.
> >
> > The last option is my least favorite one: Moving the ExecutionGraphInfo
> out
> > of the JobManager seems to be the right thing to do when thinking about
> > Flink's vision to become cloud-native.
> >
> > Just my 2cts on that topic.
> > Matthias
> >
> > On Mon, Sep 4, 2023 at 1:11 PM Matthias Pohl 
> > wrote:
> >
> > > Hi everyone,
> > > I want to open the discussion on FLIP-360 [1]. The goal of this FLIP is
> > to
> > > combine the two very similar components ExecutionGraphInfoStore and
> > > JobResultStore into a single component.
> > >
> > > The benefit of this effort would be to expose the metadata of a
> > > globally-terminated job even in cases where the JobManager fails
> shortly
> > > after the job finished. This is relevant for external checkpoint
> > management
> > > (like it's done in the Kubernetes Operator) which relies on the
> > checkpoint
> > > information to be available.
> > >
> > > More generally, it would allow completed jobs to be listed as part of
> the
> > > Flink cluster even after a JM failover. This would allow users to gain
> > more
> > > control over finished jobs.
> > >
> > > The current state of the FLIP doesn't come up with a final conclusion
> on
> > > the serialization format of the data (JSON vs binary). I want to
> > emphasize
> > > that there's also a third option which keeps both components separate
> and
> > > only exposes the additional checkpoint information through the
> > > JobResultStore.
> > >
> > > I'm looking forward to feedback.
> > > Best,
> > > Matthias
> > >
> > > PS: I might be less responsive in the next 2-3 weeks but want to
> initiate
> > > the discussion, anyway.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore
> > >
> >
>


Re: [DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-09-08 Thread Gyula Fóra
Hi Matthias!

Thank you for the detailed proposal, overall I am in favor of making this
unification to simplify the logic and make the integration for external
components more straightforward.
I will try to read through the proposal more carefully next week and
provide some detailed feedback.

+1

Thanks
Gyula

On Fri, Sep 8, 2023 at 8:36 AM Matthias Pohl 
wrote:

> Just a bit more elaboration on the question that we need to answer here: Do
> we want to expose the internal ArchivedExecutionGraph data structure
> through JSON?
>
> - The JSON approach allows the user to have (almost) full access to the
> information (that would be otherwise derived from the REST API). Therefore,
> there's no need to spin up a cluster to access this information.
> Any information that shall be exposed through the REST API needs to be
> well-defined in this JSON structure, though. Large parts of the
> ArchivedExecutionGraph data structure (essentially anything that shall be
> used to populate the REST API) become public domain, though, which puts
> more constraints on this data structure and makes it harder to change it in
> the future.
>
> - The binary data approach allows us to keep the data structure itself
> internal. We have more control over what we want to expose by providing
> access points in the ClusterClient (e.g. just add a command to extract the
> external storage path from the file).
>
> - The compromise (i.e. keeping ExecutionGraphInfoStore and JobResultStore
> separate and just expose the checkpoint information next to the JobResult
> in the JobResultStore file) would keep us the closest to the current state,
> requires the least code changes and the least exposure of internal data
> structures. It would allow any system (like the Kubernetes Operator) to
> extract the checkpoint's external storage path. But we would still be stuck
> with kind-of redundant components.
>
> From a user's perspective, I feel like the JSON approach is the best one
> because it gives him/her the most freedom to be independent of Flink
> binaries when handling completed jobs. But I see benefits from a Flink
> developer's perspective to not expose the entire data structure but use the
> ClusterClient as an access point.
>
> The last option is my least favorite one: Moving the ExecutionGraphInfo out
> of the JobManager seems to be the right thing to do when thinking about
> Flink's vision to become cloud-native.
>
> Just my 2cts on that topic.
> Matthias
>
> On Mon, Sep 4, 2023 at 1:11 PM Matthias Pohl 
> wrote:
>
> > Hi everyone,
> > I want to open the discussion on FLIP-360 [1]. The goal of this FLIP is
> to
> > combine the two very similar components ExecutionGraphInfoStore and
> > JobResultStore into a single component.
> >
> > The benefit of this effort would be to expose the metadata of a
> > globally-terminated job even in cases where the JobManager fails shortly
> > after the job finished. This is relevant for external checkpoint
> management
> > (like it's done in the Kubernetes Operator) which relies on the
> checkpoint
> > information to be available.
> >
> > More generally, it would allow completed jobs to be listed as part of the
> > Flink cluster even after a JM failover. This would allow users to gain
> more
> > control over finished jobs.
> >
> > The current state of the FLIP doesn't come up with a final conclusion on
> > the serialization format of the data (JSON vs binary). I want to
> emphasize
> > that there's also a third option which keeps both components separate and
> > only exposes the additional checkpoint information through the
> > JobResultStore.
> >
> > I'm looking forward to feedback.
> > Best,
> > Matthias
> >
> > PS: I might be less responsive in the next 2-3 weeks but want to initiate
> > the discussion, anyway.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore
> >
>


Re: [DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-09-08 Thread Matthias Pohl
Just a bit more elaboration on the question that we need to answer here: Do
we want to expose the internal ArchivedExecutionGraph data structure
through JSON?

- The JSON approach allows the user to have (almost) full access to the
information (that would be otherwise derived from the REST API). Therefore,
there's no need to spin up a cluster to access this information.
Any information that shall be exposed through the REST API needs to be
well-defined in this JSON structure, though. Large parts of the
ArchivedExecutionGraph data structure (essentially anything that shall be
used to populate the REST API) become public domain, though, which puts
more constraints on this data structure and makes it harder to change it in
the future.

- The binary data approach allows us to keep the data structure itself
internal. We have more control over what we want to expose by providing
access points in the ClusterClient (e.g. just add a command to extract the
external storage path from the file).

- The compromise (i.e. keeping ExecutionGraphInfoStore and JobResultStore
separate and just expose the checkpoint information next to the JobResult
in the JobResultStore file) would keep us the closest to the current state,
requires the least code changes and the least exposure of internal data
structures. It would allow any system (like the Kubernetes Operator) to
extract the checkpoint's external storage path. But we would still be stuck
with kind-of redundant components.

>From a user's perspective, I feel like the JSON approach is the best one
because it gives him/her the most freedom to be independent of Flink
binaries when handling completed jobs. But I see benefits from a Flink
developer's perspective to not expose the entire data structure but use the
ClusterClient as an access point.

The last option is my least favorite one: Moving the ExecutionGraphInfo out
of the JobManager seems to be the right thing to do when thinking about
Flink's vision to become cloud-native.

Just my 2cts on that topic.
Matthias

On Mon, Sep 4, 2023 at 1:11 PM Matthias Pohl  wrote:

> Hi everyone,
> I want to open the discussion on FLIP-360 [1]. The goal of this FLIP is to
> combine the two very similar components ExecutionGraphInfoStore and
> JobResultStore into a single component.
>
> The benefit of this effort would be to expose the metadata of a
> globally-terminated job even in cases where the JobManager fails shortly
> after the job finished. This is relevant for external checkpoint management
> (like it's done in the Kubernetes Operator) which relies on the checkpoint
> information to be available.
>
> More generally, it would allow completed jobs to be listed as part of the
> Flink cluster even after a JM failover. This would allow users to gain more
> control over finished jobs.
>
> The current state of the FLIP doesn't come up with a final conclusion on
> the serialization format of the data (JSON vs binary). I want to emphasize
> that there's also a third option which keeps both components separate and
> only exposes the additional checkpoint information through the
> JobResultStore.
>
> I'm looking forward to feedback.
> Best,
> Matthias
>
> PS: I might be less responsive in the next 2-3 weeks but want to initiate
> the discussion, anyway.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore
>


[DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-09-04 Thread Matthias Pohl
Hi everyone,
I want to open the discussion on FLIP-360 [1]. The goal of this FLIP is to
combine the two very similar components ExecutionGraphInfoStore and
JobResultStore into a single component.

The benefit of this effort would be to expose the metadata of a
globally-terminated job even in cases where the JobManager fails shortly
after the job finished. This is relevant for external checkpoint management
(like it's done in the Kubernetes Operator) which relies on the checkpoint
information to be available.

More generally, it would allow completed jobs to be listed as part of the
Flink cluster even after a JM failover. This would allow users to gain more
control over finished jobs.

The current state of the FLIP doesn't come up with a final conclusion on
the serialization format of the data (JSON vs binary). I want to emphasize
that there's also a third option which keeps both components separate and
only exposes the additional checkpoint information through the
JobResultStore.

I'm looking forward to feedback.
Best,
Matthias

PS: I might be less responsive in the next 2-3 weeks but want to initiate
the discussion, anyway.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore

-- 

[image: Aiven] 

*Matthias Pohl*
Opensource Software Engineer, *Aiven*
matthias.p...@aiven.io|  +49 170 9869525
aiven.io    |   
     
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B