Given the other breaking changes, I think that it is ok to remove the `RunningJobsRegistry` completely.
Since we allow users to specify a HighAvailabilityServices implementation when starting Flink via `high-availability: FQDN`, I think we should mark the interface at least @Experimental. Cheers, Till On Tue, Nov 30, 2021 at 2:29 PM Mika Naylor <m...@autophagy.io> wrote: > Hi Till, > > We thought that breaking interfaces, specifically > HighAvailabilityServices and RunningJobsRegistry, was acceptable in this > instance because: > > - Neither of these interfaces are marked @Public and so carry no > guarantees about being public and stable. > - As far as we are aware, we currently have no users with custom > HighAvailabilityServices implementations. > - The interface was already broken in 1.14 with the changes to > CheckpointRecoveryFactory, and will likely be changed again in 1.15 > due to further changes in that factory. > > Given that, we thought changes to the interface would not be disruptive. > Perhaps it could be annotated as @Internal - I'm not sure exactly what > guarantees we try and give for the stability of the > HighAvailabilityServices interface. > > Kind regards, > Mika > > On 26.11.2021 18:28, Till Rohrmann wrote: > >Thanks for creating this FLIP Matthias, Mika and David. > > > >I think the JobResultStore is an important piece for fixing Flink's last > >high-availability problem (afaik). Once we have this piece in place, users > >no longer risk to re-execute a successfully completed job. > > > >I have one comment concerning breaking interfaces: > > > >If we don't want to break interfaces, then we could keep the > >HighAvailabilityServices.getRunningJobsRegistry() method and add a default > >implementation for HighAvailabilityServices.getJobResultStore(). We could > >then deprecate the former method and then remove it in the subsequent > >release (1.16). > > > >Apart from that, +1 for the FLIP. > > > >Cheers, > >Till > > > >On Wed, Nov 17, 2021 at 6:05 PM David Morávek <d...@apache.org> wrote: > > > >> Hi everyone, > >> > >> Matthias, Mika and I want to start a discussion about introduction of a > new > >> Flink component, the *JobResultStore*. > >> > >> The main motivation is to address shortcomings of the > *RunningJobsRegistry* > >> and surpass it with the new component. These shortcomings have been > first > >> described in FLINK-11813 [1]. > >> > >> This change should improve the overall stability of the JobManager's > >> components and address the race conditions in some of the fail over > >> scenarios during the job cleanup lifecycle. > >> > >> It should also help to ensure that Flink doesn't leave any uncleaned > >> resources behind. > >> > >> We've prepared a FLIP-194 [2], which outlines the design and reasoning > >> behind this new component. > >> > >> [1] https://issues.apache.org/jira/browse/FLINK-11813 > >> [2] > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435 > >> > >> We're looking forward for your feedback ;) > >> > >> Best, > >> Matthias, Mika and David > >> > > Mika Naylor > https://autophagy.io >