Thanks to Yangze driving this proposal!

Overall looks good to me! This proposal is useful for
the performance when the job doesn't need the failover.

I have some minor questions:

1. How does it work with FLIP-383[1]?

This FLIP introduces a high-availability.enable-job-recovery,
and FLIP-383 introduces a execution.batch.job-recovery.enabled.

IIUC, when high-availability.enable-job-recovery is false, the job
cannot recover even if execution.batch.job-recovery.enabled = true,
right?

If so, could we check some parameters and warn some logs? Or
disable the execution.batch.job-recovery.enabled directly when
high-availability.enable-job-recovery = false.

2. Could we rename it to high-availability.job-recovery.enabled to unify
the naming?

WDYT?

[1] https://cwiki.apache.org/confluence/x/QwqZE

Best,
Rui

On Mon, Jan 8, 2024 at 2:04 PM Yangze Guo <karma...@gmail.com> wrote:

> Thanks for your comment, Yong.
>
> Here are my thoughts on the splitting of HighAvailableServices:
> Firstly, I would treat this separation as a result of technical debt
> and a side effect of the FLIP. In order to achieve a cleaner interface
> hierarchy for High Availability before Flink 2.0, the design decision
> should not be limited to OLAP scenarios.
> I agree that the current HAServices can be divided based on either the
> actual target (cluster & job) or the type of functionality (leader
> election & persistence). From a conceptual perspective, I do not see
> one approach being better than the other. However, I have chosen the
> current separation for a clear separation of concerns. After FLIP-285,
> each process has a dedicated LeaderElectionService responsible for
> leader election of all the components within it. This
> LeaderElectionService has its own lifecycle management. If we were to
> split the HAServices into 'ClusterHighAvailabilityService' and
> 'JobHighAvailabilityService', we would need to couple the lifecycle
> management of these two interfaces, as they both rely on the
> LeaderElectionService and other relevant classes. This coupling and
> implicit design assumption will increase the complexity and testing
> difficulty of the system. WDYT?
>
> Best,
> Yangze Guo
>
> On Mon, Jan 8, 2024 at 12:08 PM Yong Fang <zjur...@gmail.com> wrote:
> >
> > Thanks Yangze for starting this discussion. I have one comment: why do we
> > need to abstract two services as `LeaderServices` and
> > `PersistenceServices`?
> >
> > From the content, the purpose of this FLIP is to make job failover more
> > lightweight, so it would be more appropriate to abstract two services as
> > `ClusterHighAvailabilityService` and `JobHighAvailabilityService` instead
> > of `LeaderServices` and `PersistenceServices` based on leader and store.
> In
> > this way, we can create a `JobHighAvailabilityService` that has a leader
> > service and store for the job that meets the requirements based on the
> > configuration in the zk/k8s high availability service.
> >
> > WDYT?
> >
> > Best,
> > Fang Yong
> >
> > On Fri, Dec 29, 2023 at 8:10 PM xiangyu feng <xiangyu...@gmail.com>
> wrote:
> >
> > > Thanks Yangze for restart this discussion.
> > >
> > > +1 for the overall idea. By splitting the HighAvailabilityServices into
> > > LeaderServices and PersistenceServices, we may support configuring
> > > different storage behind them in the future.
> > >
> > > We did run into real problems in production where too much job
> metadata was
> > > being stored on ZK, causing system instability.
> > >
> > >
> > > Yangze Guo <karma...@gmail.com> 于2023年12月29日周五 10:21写道:
> > >
> > > > Thanks for the response, Zhanghao.
> > > >
> > > > PersistenceServices sounds good to me.
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen
> > > > <zhanghao.c...@outlook.com> wrote:
> > > > >
> > > > > Thanks for driving this effort, Yangze! The proposal overall LGTM.
> > > Other
> > > > from the throughput enhancement in the OLAP scenario, the separation
> of
> > > > leader election/discovery services and the metadata persistence
> services
> > > > will also make the HA impl clearer and easier to maintain. Just a
> minor
> > > > comment on naming: would it better to rename PersistentServices to
> > > > PersistenceServices, as usually we put a noun before Services?
> > > > >
> > > > > Best,
> > > > > Zhanghao Chen
> > > > > ________________________________
> > > > > From: Yangze Guo <karma...@gmail.com>
> > > > > Sent: Tuesday, December 19, 2023 17:33
> > > > > To: dev <dev@flink.apache.org>
> > > > > Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP
> > > > Scenarios
> > > > >
> > > > > Hi, there,
> > > > >
> > > > > We would like to start a discussion thread on "FLIP-403: High
> > > > > Availability Services for OLAP Scenarios"[1].
> > > > >
> > > > > Currently, Flink's high availability service consists of two
> > > > > mechanisms: leader election/retrieval services for JobManager and
> > > > > persistent services for job metadata. However, these mechanisms are
> > > > > set up in an "all or nothing" manner. In OLAP scenarios, we
> typically
> > > > > only require leader election/retrieval services for JobManager
> > > > > components since jobs usually do not have a restart strategy.
> > > > > Additionally, the persistence of job states can negatively impact
> the
> > > > > cluster's throughput, especially for short query jobs.
> > > > >
> > > > > To address these issues, this FLIP proposes splitting the
> > > > > HighAvailabilityServices into LeaderServices and
> PersistentServices,
> > > > > and enable users to independently configure the high availability
> > > > > strategies specifically related to jobs.
> > > > >
> > > > > Please find more details in the FLIP wiki document [1]. Looking
> > > > > forward to your feedback.
> > > > >
> > > > > [1]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > >
> > >
>

Reply via email to