Re: [DISCUSS] FLIP-530: Dynamic job configuration

Chesnay Schepler Thu, 15 May 2025 06:43:26 -0700

Documenting the supported options is a fair concern, but at the sametime also a mountain of work as it would require going through alloptions and creating well-defined rules for what is a job setting andwhat isn't, enforcing that and possibly also change a whole bunch ofcode to make that remotely consistent.

I would say just documenting a few use-cases, like changing thecheckpoint interval for example, would already be good enough.Changing the checkpointing interval on it's own would justify thisentire effort; anything else that happens to work without explicitdocumentation could then just be a bonus for power users.

I'd may suggest to return FORBIDDEN if an option is provided in therequest that's not allow listed be changed, and limit bad request toinvalid json.


But as-is already +1 from my side.

On 12/05/2025 07:33, Junrui Lee wrote:

Hi Roman

Thanks for driving this feature. +1 for this proposal.

I also agree with the suggestion made by Feifan.

Currently, not all configuration items are job-level configurations [1].
Even for those that are, not all job-level config options can be updated at
runtime through the Adaptive Scheduler. For instance, certain config option
related to job plan compilation, such as pipeline.operator-chaining.enabled
and nearly all of the table.* settings, are not eligible for runtime
updates.

>From a user perspective, it would be beneficial to clearly describe which
config options can be dynamically updated, allowing users to take better
advantage of this feature.

Best,
Junrui

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope

Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道：

Thanks Roman for driving this useful improvement, +1 for this proposal.

Also thanks discussion from Hangxiang and Rui Fan. Regarding question 1, I
have some ideas for discussion:

Based on the consideration of providing stable expectations for users, I
think we should perform configuration checks in a whitelist manner. Ensure
that the configurations allowed to be modified through this API can
actually
take effect.

In the initial version, we can provide a very small whitelist list, even if
it only contains a few configurations that we most want to use and have
been
confirmed to be effective. This list can be continuously supplemented
later.

——————————————

Best regards,
Feifan Wang

Overall LGTM, thanks to Hangxiang for the questions, and I have the
same questions with Hangxiang. I'd like to share my thoughts:

For question1 about validation:

I think validation is necessary, but both the list of valid configurations
and
the list of invalid configurations have limitations.

For valid configurations: IIUC, almost all job level configurations are
valid
after restarting the job by the adaptive scheduler. It means lots of new
configurations need to be added to the list if we list valid
configurations.
If other developers miss it, the new configuration will fail validation(but
it works).

For invalid configurations: I encountered a problem before, where the user
added a non-existent flink configuration, but flink could not detect it.
It may be caused by typo. Therefore, even if we list some Flink
configurations
that do not support dynamic modification, we still cannot guarantee that
the
configurations outside the list will take effect.

Even so, I prefer to do limited validation, for example: not through a
list,
but hard code a few rules (e.g. table.* doesn't work).

For question 2 about configuration change history:

Logging configuration change history in the first version is fine.

As I understand, both of configuration change and resource requirements
change
could trigger a rescale for Adaptive Scheduler. So rescale history can
probably
include both. If we want to show the configuration change history, it might
be
more appropriate to put it in FLIP-487[1] and FLIP-495[2].

For question 3 about co-works with other dynamic requests:

Configuration changes are applied immediately; resource requirements
changes are applied with some delay

Yes, rescale after some delay could reduce the rescale frequency to avoid
some invalid restarts. So I'm curious why configuration changes don't
respect the delay mechanism?

Please correct me if anything is wrong, thanks!

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
[2]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history

Best,
Rui

On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org>
wrote:

Thanks Hangxiang Yu,

Please find the answers below

1. Yes, we should perform validation before trying to update the
configuration. I'd rather validate some specific options that are known to
not work though. Finding and hard-coding all the valid options might be
impractical since they can change, and non trivial.

2. That would be great, but we'd have to store the history of such updates
somewhere. For debugging purposes, logs should suffice I think

3. That's a great question! Configuration changes are applied immediately;
resource requirements changes are applied with some delay; and both are
stored in HA immediately. So configuration change request results also in
restarting and applying why pending resource requirements changes

Regards,
Roman

On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote:

Hi, Roman.

Thanks for the FLIP.
+1 for supporting dynamic configuration to reduce manual restart.

I just have below questions:

1. Do we need a working configuration list ? So some unsupported
configurations could be rejected in advance.

2. Could we show the change history in the Web UI ? So more changed
details
could be tracked.

3. How does it co-works with other dynamic requests ? For example, it
modifies the parallelisms together with '
/jobs/:jobid/resource-requirements'.

On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org>
wrote:

Hi everyone,

I would like to start a discussion about FLIP-530: Dynamic job
configuration [1].

In some cases, it is desirable to change Flink job configuration after
it
was submitted to Flink, for example:
- Troubleshooting (e.g. increase checkpoint timeout or failure
threshold)
- Performance optimization, (e.g. tuning state backend parameters)
- Enabling new features after testing them in a non-Production
environment.
This allows to de-couple upgrading to newer Flink versions from
actually
enabling the features.
To support such use-cases, we propose to enhance Flink job
configuration
REST-endpoint with the support to read full job configuration; and
update
it.

Looking forward to feedback.

[1]
https://cwiki.apache.org/confluence/x/uglKFQ

Regards,
Roman

--
Best,
Hangxiang.

Re: [DISCUSS] FLIP-530: Dynamic job configuration

Reply via email to