Hi Matthias! Thank you for the detailed proposal, overall I am in favor of making this unification to simplify the logic and make the integration for external components more straightforward. I will try to read through the proposal more carefully next week and provide some detailed feedback.
+1 Thanks Gyula On Fri, Sep 8, 2023 at 8:36 AM Matthias Pohl <matthias.p...@aiven.io.invalid> wrote: > Just a bit more elaboration on the question that we need to answer here: Do > we want to expose the internal ArchivedExecutionGraph data structure > through JSON? > > - The JSON approach allows the user to have (almost) full access to the > information (that would be otherwise derived from the REST API). Therefore, > there's no need to spin up a cluster to access this information. > Any information that shall be exposed through the REST API needs to be > well-defined in this JSON structure, though. Large parts of the > ArchivedExecutionGraph data structure (essentially anything that shall be > used to populate the REST API) become public domain, though, which puts > more constraints on this data structure and makes it harder to change it in > the future. > > - The binary data approach allows us to keep the data structure itself > internal. We have more control over what we want to expose by providing > access points in the ClusterClient (e.g. just add a command to extract the > external storage path from the file). > > - The compromise (i.e. keeping ExecutionGraphInfoStore and JobResultStore > separate and just expose the checkpoint information next to the JobResult > in the JobResultStore file) would keep us the closest to the current state, > requires the least code changes and the least exposure of internal data > structures. It would allow any system (like the Kubernetes Operator) to > extract the checkpoint's external storage path. But we would still be stuck > with kind-of redundant components. > > From a user's perspective, I feel like the JSON approach is the best one > because it gives him/her the most freedom to be independent of Flink > binaries when handling completed jobs. But I see benefits from a Flink > developer's perspective to not expose the entire data structure but use the > ClusterClient as an access point. > > The last option is my least favorite one: Moving the ExecutionGraphInfo out > of the JobManager seems to be the right thing to do when thinking about > Flink's vision to become cloud-native. > > Just my 2cts on that topic. > Matthias > > On Mon, Sep 4, 2023 at 1:11 PM Matthias Pohl <matthias.p...@aiven.io> > wrote: > > > Hi everyone, > > I want to open the discussion on FLIP-360 [1]. The goal of this FLIP is > to > > combine the two very similar components ExecutionGraphInfoStore and > > JobResultStore into a single component. > > > > The benefit of this effort would be to expose the metadata of a > > globally-terminated job even in cases where the JobManager fails shortly > > after the job finished. This is relevant for external checkpoint > management > > (like it's done in the Kubernetes Operator) which relies on the > checkpoint > > information to be available. > > > > More generally, it would allow completed jobs to be listed as part of the > > Flink cluster even after a JM failover. This would allow users to gain > more > > control over finished jobs. > > > > The current state of the FLIP doesn't come up with a final conclusion on > > the serialization format of the data (JSON vs binary). I want to > emphasize > > that there's also a third option which keeps both components separate and > > only exposes the additional checkpoint information through the > > JobResultStore. > > > > I'm looking forward to feedback. > > Best, > > Matthias > > > > PS: I might be less responsive in the next 2-3 weeks but want to initiate > > the discussion, anyway. > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore > > >