Hi Roman Thanks for driving this feature. +1 for this proposal.
I also agree with the suggestion made by Feifan. Currently, not all configuration items are job-level configurations [1]. Even for those that are, not all job-level config options can be updated at runtime through the Adaptive Scheduler. For instance, certain config option related to job plan compilation, such as pipeline.operator-chaining.enabled and nearly all of the table.* settings, are not eligible for runtime updates. >From a user perspective, it would be beneficial to clearly describe which config options can be dynamically updated, allowing users to take better advantage of this feature. Best, Junrui [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道: > Thanks Roman for driving this useful improvement, +1 for this proposal. > > Also thanks discussion from Hangxiang and Rui Fan. Regarding question 1, I > have some ideas for discussion: > > Based on the consideration of providing stable expectations for users, I > think we should perform configuration checks in a whitelist manner. Ensure > that the configurations allowed to be modified through this API can > actually > take effect. > > In the initial version, we can provide a very small whitelist list, even if > it only contains a few configurations that we most want to use and have > been > confirmed to be effective. This list can be continuously supplemented > later. > > > —————————————— > > Best regards, > Feifan Wang > > > > ---- Replied Message ---- > | From | Rui Fan<1996fan...@gmail.com> | > | Date | 05/11/2025 16:36 | > | To | <dev@flink.apache.org> | > | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration | > Thanks Roman for driving this valuable proposal, it uses the Adaptive > Scheduler to greatly reduce the downtime of configuration updates, > so +1 for this proposal! > > Overall LGTM, thanks to Hangxiang for the questions, and I have the > same questions with Hangxiang. I'd like to share my thoughts: > > > For question1 about validation: > > I think validation is necessary, but both the list of valid configurations > and > the list of invalid configurations have limitations. > > For valid configurations: IIUC, almost all job level configurations are > valid > after restarting the job by the adaptive scheduler. It means lots of new > configurations need to be added to the list if we list valid > configurations. > If other developers miss it, the new configuration will fail validation(but > it works). > > For invalid configurations: I encountered a problem before, where the user > added a non-existent flink configuration, but flink could not detect it. > It may be caused by typo. Therefore, even if we list some Flink > configurations > that do not support dynamic modification, we still cannot guarantee that > the > configurations outside the list will take effect. > > Even so, I prefer to do limited validation, for example: not through a > list, > but hard code a few rules (e.g. table.* doesn't work). > > > For question 2 about configuration change history: > > Logging configuration change history in the first version is fine. > > As I understand, both of configuration change and resource requirements > change > could trigger a rescale for Adaptive Scheduler. So rescale history can > probably > include both. If we want to show the configuration change history, it might > be > more appropriate to put it in FLIP-487[1] and FLIP-495[2]. > > For question 3 about co-works with other dynamic requests: > > Configuration changes are applied immediately; resource requirements > changes are applied with some delay > > Yes, rescale after some delay could reduce the rescale frequency to avoid > some invalid restarts. So I'm curious why configuration changes don't > respect the delay mechanism? > > Please correct me if anything is wrong, thanks! > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > Best, > Rui > > > On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org> > wrote: > > Thanks Hangxiang Yu, > > Please find the answers below > > 1. Yes, we should perform validation before trying to update the > configuration. I'd rather validate some specific options that are known to > not work though. Finding and hard-coding all the valid options might be > impractical since they can change, and non trivial. > > 2. That would be great, but we'd have to store the history of such updates > somewhere. For debugging purposes, logs should suffice I think > > 3. That's a great question! Configuration changes are applied immediately; > resource requirements changes are applied with some delay; and both are > stored in HA immediately. So configuration change request results also in > restarting and applying why pending resource requirements changes > > > Regards, > Roman > > On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote: > > Hi, Roman. > > Thanks for the FLIP. > +1 for supporting dynamic configuration to reduce manual restart. > > > I just have below questions: > > 1. Do we need a working configuration list ? So some unsupported > configurations could be rejected in advance. > > 2. Could we show the change history in the Web UI ? So more changed > details > could be tracked. > > 3. How does it co-works with other dynamic requests ? For example, it > modifies the parallelisms together with ' > /jobs/:jobid/resource-requirements'. > > On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org> > wrote: > > Hi everyone, > > I would like to start a discussion about FLIP-530: Dynamic job > configuration [1]. > > In some cases, it is desirable to change Flink job configuration after > it > was submitted to Flink, for example: > - Troubleshooting (e.g. increase checkpoint timeout or failure > threshold) > - Performance optimization, (e.g. tuning state backend parameters) > - Enabling new features after testing them in a non-Production > environment. > This allows to de-couple upgrading to newer Flink versions from > actually > enabling the features. > To support such use-cases, we propose to enhance Flink job > configuration > REST-endpoint with the support to read full job configuration; and > update > it. > > Looking forward to feedback. > > [1] > https://cwiki.apache.org/confluence/x/uglKFQ > > Regards, > Roman > > > > -- > Best, > Hangxiang. > > >