Hi Roman, Thank you for the proposal. This is a much-needed feature.
One question: For the PUT request, does it make sense to use a distinct HTTP status code in the response when the request fails due to a conflicting update? Since the new expected version field is included in the body, 409 Conflict could be used in case of a rejection, or alternatively, 412 Precondition Failed could be used if the version is moved to a header. I believe this could make the API easier to use programmatically as it would simplify error handling Best regards, Kaigorodov Andrei On Mon, May 12, 2025 at 7:34 AM Junrui Lee <jrlee....@gmail.com> wrote: > Hi Roman > > Thanks for driving this feature. +1 for this proposal. > > I also agree with the suggestion made by Feifan. > > Currently, not all configuration items are job-level configurations [1]. > Even for those that are, not all job-level config options can be updated at > runtime through the Adaptive Scheduler. For instance, certain config option > related to job plan compilation, such as pipeline.operator-chaining.enabled > and nearly all of the table.* settings, are not eligible for runtime > updates. > > From a user perspective, it would be beneficial to clearly describe which > config options can be dynamically updated, allowing users to take better > advantage of this feature. > > Best, > Junrui > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope > > Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道: > > > Thanks Roman for driving this useful improvement, +1 for this proposal. > > > > Also thanks discussion from Hangxiang and Rui Fan. Regarding question 1, > I > > have some ideas for discussion: > > > > Based on the consideration of providing stable expectations for users, I > > think we should perform configuration checks in a whitelist manner. > Ensure > > that the configurations allowed to be modified through this API can > > actually > > take effect. > > > > In the initial version, we can provide a very small whitelist list, even > if > > it only contains a few configurations that we most want to use and have > > been > > confirmed to be effective. This list can be continuously supplemented > > later. > > > > > > —————————————— > > > > Best regards, > > Feifan Wang > > > > > > > > ---- Replied Message ---- > > | From | Rui Fan<1996fan...@gmail.com> | > > | Date | 05/11/2025 16:36 | > > | To | <dev@flink.apache.org> | > > | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration | > > Thanks Roman for driving this valuable proposal, it uses the Adaptive > > Scheduler to greatly reduce the downtime of configuration updates, > > so +1 for this proposal! > > > > Overall LGTM, thanks to Hangxiang for the questions, and I have the > > same questions with Hangxiang. I'd like to share my thoughts: > > > > > > For question1 about validation: > > > > I think validation is necessary, but both the list of valid > configurations > > and > > the list of invalid configurations have limitations. > > > > For valid configurations: IIUC, almost all job level configurations are > > valid > > after restarting the job by the adaptive scheduler. It means lots of new > > configurations need to be added to the list if we list valid > > configurations. > > If other developers miss it, the new configuration will fail > validation(but > > it works). > > > > For invalid configurations: I encountered a problem before, where the > user > > added a non-existent flink configuration, but flink could not detect it. > > It may be caused by typo. Therefore, even if we list some Flink > > configurations > > that do not support dynamic modification, we still cannot guarantee that > > the > > configurations outside the list will take effect. > > > > Even so, I prefer to do limited validation, for example: not through a > > list, > > but hard code a few rules (e.g. table.* doesn't work). > > > > > > For question 2 about configuration change history: > > > > Logging configuration change history in the first version is fine. > > > > As I understand, both of configuration change and resource requirements > > change > > could trigger a rescale for Adaptive Scheduler. So rescale history can > > probably > > include both. If we want to show the configuration change history, it > might > > be > > more appropriate to put it in FLIP-487[1] and FLIP-495[2]. > > > > For question 3 about co-works with other dynamic requests: > > > > Configuration changes are applied immediately; resource requirements > > changes are applied with some delay > > > > Yes, rescale after some delay could reduce the rescale frequency to avoid > > some invalid restarts. So I'm curious why configuration changes don't > > respect the delay mechanism? > > > > Please correct me if anything is wrong, thanks! > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > > [2] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > > > Best, > > Rui > > > > > > On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org> > > wrote: > > > > Thanks Hangxiang Yu, > > > > Please find the answers below > > > > 1. Yes, we should perform validation before trying to update the > > configuration. I'd rather validate some specific options that are known > to > > not work though. Finding and hard-coding all the valid options might be > > impractical since they can change, and non trivial. > > > > 2. That would be great, but we'd have to store the history of such > updates > > somewhere. For debugging purposes, logs should suffice I think > > > > 3. That's a great question! Configuration changes are applied > immediately; > > resource requirements changes are applied with some delay; and both are > > stored in HA immediately. So configuration change request results also in > > restarting and applying why pending resource requirements changes > > > > > > Regards, > > Roman > > > > On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote: > > > > Hi, Roman. > > > > Thanks for the FLIP. > > +1 for supporting dynamic configuration to reduce manual restart. > > > > > > I just have below questions: > > > > 1. Do we need a working configuration list ? So some unsupported > > configurations could be rejected in advance. > > > > 2. Could we show the change history in the Web UI ? So more changed > > details > > could be tracked. > > > > 3. How does it co-works with other dynamic requests ? For example, it > > modifies the parallelisms together with ' > > /jobs/:jobid/resource-requirements'. > > > > On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org> > > wrote: > > > > Hi everyone, > > > > I would like to start a discussion about FLIP-530: Dynamic job > > configuration [1]. > > > > In some cases, it is desirable to change Flink job configuration after > > it > > was submitted to Flink, for example: > > - Troubleshooting (e.g. increase checkpoint timeout or failure > > threshold) > > - Performance optimization, (e.g. tuning state backend parameters) > > - Enabling new features after testing them in a non-Production > > environment. > > This allows to de-couple upgrading to newer Flink versions from > > actually > > enabling the features. > > To support such use-cases, we propose to enhance Flink job > > configuration > > REST-endpoint with the support to read full job configuration; and > > update > > it. > > > > Looking forward to feedback. > > > > [1] > > https://cwiki.apache.org/confluence/x/uglKFQ > > > > Regards, > > Roman > > > > > > > > -- > > Best, > > Hangxiang. > > > > > > >