Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

Till Rohrmann Tue, 30 Nov 2021 08:28:09 -0800

Given the other breaking changes, I think that it is ok to remove the
`RunningJobsRegistry` completely.


Since we allow users to specify a HighAvailabilityServices implementation
when starting Flink via `high-availability: FQDN`, I think we should mark
the interface at least @Experimental.

Cheers,
Till

On Tue, Nov 30, 2021 at 2:29 PM Mika Naylor <m...@autophagy.io> wrote:

> Hi Till,
>
> We thought that breaking interfaces, specifically
> HighAvailabilityServices and RunningJobsRegistry, was acceptable in this
> instance because:
>
> - Neither of these interfaces are marked @Public and so carry no
>    guarantees about being public and stable.
> - As far as we are aware, we currently have no users with custom
>    HighAvailabilityServices implementations.
> - The interface was already broken in 1.14 with the changes to
>    CheckpointRecoveryFactory, and will likely be changed again in 1.15
>    due to further changes in that factory.
>
> Given that, we thought changes to the interface would not be disruptive.
> Perhaps it could be annotated as @Internal - I'm not sure exactly what
> guarantees we try and give for the stability of the
> HighAvailabilityServices interface.
>
> Kind regards,
> Mika
>
> On 26.11.2021 18:28, Till Rohrmann wrote:
> >Thanks for creating this FLIP Matthias, Mika and David.
> >
> >I think the JobResultStore is an important piece for fixing Flink's last
> >high-availability problem (afaik). Once we have this piece in place, users
> >no longer risk to re-execute a successfully completed job.
> >
> >I have one comment concerning breaking interfaces:
> >
> >If we don't want to break interfaces, then we could keep the
> >HighAvailabilityServices.getRunningJobsRegistry() method and add a default
> >implementation for HighAvailabilityServices.getJobResultStore(). We could
> >then deprecate the former method and then remove it in the subsequent
> >release (1.16).
> >
> >Apart from that, +1 for the FLIP.
> >
> >Cheers,
> >Till
> >
> >On Wed, Nov 17, 2021 at 6:05 PM David Morávek <d...@apache.org> wrote:
> >
> >> Hi everyone,
> >>
> >> Matthias, Mika and I want to start a discussion about introduction of a
> new
> >> Flink component, the *JobResultStore*.
> >>
> >> The main motivation is to address shortcomings of the
> *RunningJobsRegistry*
> >> and surpass it with the new component. These shortcomings have been
> first
> >> described in FLINK-11813 [1].
> >>
> >> This change should improve the overall stability of the JobManager's
> >> components and address the race conditions in some of the fail over
> >> scenarios during the job cleanup lifecycle.
> >>
> >> It should also help to ensure that Flink doesn't leave any uncleaned
> >> resources behind.
> >>
> >> We've prepared a FLIP-194 [2], which outlines the design and reasoning
> >> behind this new component.
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-11813
> >> [2]
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435
> >>
> >> We're looking forward for your feedback ;)
> >>
> >> Best,
> >> Matthias, Mika and David
> >>
>
> Mika Naylor
> https://autophagy.io
>

Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

Reply via email to