Thanks Roman for driving this valuable proposal, it uses the Adaptive
Scheduler to greatly reduce the downtime of configuration updates,
so +1 for this proposal!

Overall LGTM, thanks to Hangxiang for the questions, and I have the
same questions with Hangxiang. I'd like to share my thoughts:


For question1 about validation:

I think validation is necessary, but both the list of valid configurations
and
the list of invalid configurations have limitations.

For valid configurations: IIUC, almost all job level configurations are
valid
after restarting the job by the adaptive scheduler. It means lots of new
configurations need to be added to the list if we list valid configurations.
If other developers miss it, the new configuration will fail validation(but
it works).

For invalid configurations: I encountered a problem before, where the user
added a non-existent flink configuration, but flink could not detect it.
It may be caused by typo. Therefore, even if we list some Flink
configurations
that do not support dynamic modification, we still cannot guarantee that
the
configurations outside the list will take effect.

Even so, I prefer to do limited validation, for example: not through a
list,
but hard code a few rules (e.g. table.* doesn't work).


For question 2 about configuration change history:

Logging configuration change history in the first version is fine.

As I understand, both of configuration change and resource requirements
change
could trigger a rescale for Adaptive Scheduler. So rescale history can
probably
include both. If we want to show the configuration change history, it might
be
more appropriate to put it in FLIP-487[1] and FLIP-495[2].

For question 3 about co-works with other dynamic requests:

> Configuration changes are applied immediately; resource requirements
changes are applied with some delay

Yes, rescale after some delay could reduce the rescale frequency to avoid
some invalid restarts. So I'm curious why configuration changes don't
respect the delay mechanism?

Please correct me if anything is wrong, thanks!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history

Best,
Rui


On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org> wrote:

> Thanks Hangxiang Yu,
>
> Please find the answers below
>
> 1. Yes, we should perform validation before trying to update the
> configuration. I'd rather validate some specific options that are known to
> not work though. Finding and hard-coding all the valid options might be
> impractical since they can change, and non trivial.
>
> 2. That would be great, but we'd have to store the history of such updates
> somewhere. For debugging purposes, logs should suffice I think
>
> 3. That's a great question! Configuration changes are applied immediately;
> resource requirements changes are applied with some delay; and both are
> stored in HA immediately. So configuration change request results also in
> restarting and applying why pending resource requirements changes
>
>
> Regards,
> Roman
>
> On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote:
>
> > Hi, Roman.
> >
> > Thanks for the FLIP.
> > +1 for supporting dynamic configuration to reduce manual restart.
> >
> >
> > I just have below questions:
> >
> > 1. Do we need a working configuration list ? So some unsupported
> > configurations could be rejected in advance.
> >
> > 2. Could we show the change history in the Web UI ? So more changed
> details
> > could be tracked.
> >
> > 3. How does it co-works with other dynamic requests ? For example, it
> > modifies the parallelisms together with '
> > /jobs/:jobid/resource-requirements'.
> >
> > On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org>
> wrote:
> >
> > > Hi everyone,
> > >
> > > I would like to start a discussion about FLIP-530: Dynamic job
> > > configuration [1].
> > >
> > > In some cases, it is desirable to change Flink job configuration after
> it
> > > was submitted to Flink, for example:
> > > - Troubleshooting (e.g. increase checkpoint timeout or failure
> threshold)
> > > - Performance optimization, (e.g. tuning state backend parameters)
> > > - Enabling new features after testing them in a non-Production
> > environment.
> > > This allows to de-couple upgrading to newer Flink versions from
> actually
> > > enabling the features.
> > > To support such use-cases, we propose to enhance Flink job
> configuration
> > > REST-endpoint with the support to read full job configuration; and
> update
> > >  it.
> > >
> > > Looking forward to feedback.
> > >
> > > [1]
> > > https://cwiki.apache.org/confluence/x/uglKFQ
> > >
> > > Regards,
> > > Roman
> > >
> >
> >
> > --
> > Best,
> > Hangxiang.
> >
>

Reply via email to