Thanks Chesnay, I like your idea of returning 403 for non-white-listed options. Updated the FLIP accordingly. Also, specified 'execution.checkpointing.interval' as a default value for the allow-list.
Kartikey Pant, that's a good question, and your understanding is correct. There's a possibility of breaking the job via this API after passing the validation. For example, checkpoint timeout of 1 second would be valid, but might cause the checkpoints to fail.In such a case, configuration change should be reverted via a new PUT request. Regards, Roman On Thu, May 15, 2025 at 3:45 PM Chesnay Schepler <ches...@apache.org> wrote: > Documenting the supported options is a fair concern, but at the same > time also a mountain of work as it would require going through all > options and creating well-defined rules for what is a job setting and > what isn't, enforcing that and possibly also change a whole bunch of > code to make that remotely consistent. > > I would say just documenting a few use-cases, like changing the > checkpoint interval for example, would already be good enough. > Changing the checkpointing interval on it's own would justify this > entire effort; anything else that happens to work without explicit > documentation could then just be a bonus for power users. > > I'd may suggest to return FORBIDDEN if an option is provided in the > request that's not allow listed be changed, and limit bad request to > invalid json. > > But as-is already +1 from my side. > > On 12/05/2025 07:33, Junrui Lee wrote: > > Hi Roman > > > > Thanks for driving this feature. +1 for this proposal. > > > > I also agree with the suggestion made by Feifan. > > > > Currently, not all configuration items are job-level configurations [1]. > > Even for those that are, not all job-level config options can be updated > at > > runtime through the Adaptive Scheduler. For instance, certain config > option > > related to job plan compilation, such as > pipeline.operator-chaining.enabled > > and nearly all of the table.* settings, are not eligible for runtime > > updates. > > > > >From a user perspective, it would be beneficial to clearly describe > which > > config options can be dynamically updated, allowing users to take better > > advantage of this feature. > > > > Best, > > Junrui > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope > > > > Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道: > > > >> Thanks Roman for driving this useful improvement, +1 for this proposal. > >> > >> Also thanks discussion from Hangxiang and Rui Fan. Regarding question > 1, I > >> have some ideas for discussion: > >> > >> Based on the consideration of providing stable expectations for users, I > >> think we should perform configuration checks in a whitelist manner. > Ensure > >> that the configurations allowed to be modified through this API can > >> actually > >> take effect. > >> > >> In the initial version, we can provide a very small whitelist list, > even if > >> it only contains a few configurations that we most want to use and have > >> been > >> confirmed to be effective. This list can be continuously supplemented > >> later. > >> > >> > >> —————————————— > >> > >> Best regards, > >> Feifan Wang > >> > >> > >> > >> ---- Replied Message ---- > >> | From | Rui Fan<1996fan...@gmail.com> | > >> | Date | 05/11/2025 16:36 | > >> | To | <dev@flink.apache.org> | > >> | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration | > >> Thanks Roman for driving this valuable proposal, it uses the Adaptive > >> Scheduler to greatly reduce the downtime of configuration updates, > >> so +1 for this proposal! > >> > >> Overall LGTM, thanks to Hangxiang for the questions, and I have the > >> same questions with Hangxiang. I'd like to share my thoughts: > >> > >> > >> For question1 about validation: > >> > >> I think validation is necessary, but both the list of valid > configurations > >> and > >> the list of invalid configurations have limitations. > >> > >> For valid configurations: IIUC, almost all job level configurations are > >> valid > >> after restarting the job by the adaptive scheduler. It means lots of new > >> configurations need to be added to the list if we list valid > >> configurations. > >> If other developers miss it, the new configuration will fail > validation(but > >> it works). > >> > >> For invalid configurations: I encountered a problem before, where the > user > >> added a non-existent flink configuration, but flink could not detect it. > >> It may be caused by typo. Therefore, even if we list some Flink > >> configurations > >> that do not support dynamic modification, we still cannot guarantee that > >> the > >> configurations outside the list will take effect. > >> > >> Even so, I prefer to do limited validation, for example: not through a > >> list, > >> but hard code a few rules (e.g. table.* doesn't work). > >> > >> > >> For question 2 about configuration change history: > >> > >> Logging configuration change history in the first version is fine. > >> > >> As I understand, both of configuration change and resource requirements > >> change > >> could trigger a rescale for Adaptive Scheduler. So rescale history can > >> probably > >> include both. If we want to show the configuration change history, it > might > >> be > >> more appropriate to put it in FLIP-487[1] and FLIP-495[2]. > >> > >> For question 3 about co-works with other dynamic requests: > >> > >> Configuration changes are applied immediately; resource requirements > >> changes are applied with some delay > >> > >> Yes, rescale after some delay could reduce the rescale frequency to > avoid > >> some invalid restarts. So I'm curious why configuration changes don't > >> respect the delay mechanism? > >> > >> Please correct me if anything is wrong, thanks! > >> > >> [1] > >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > >> [2] > >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > >> > >> Best, > >> Rui > >> > >> > >> On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org> > >> wrote: > >> > >> Thanks Hangxiang Yu, > >> > >> Please find the answers below > >> > >> 1. Yes, we should perform validation before trying to update the > >> configuration. I'd rather validate some specific options that are known > to > >> not work though. Finding and hard-coding all the valid options might be > >> impractical since they can change, and non trivial. > >> > >> 2. That would be great, but we'd have to store the history of such > updates > >> somewhere. For debugging purposes, logs should suffice I think > >> > >> 3. That's a great question! Configuration changes are applied > immediately; > >> resource requirements changes are applied with some delay; and both are > >> stored in HA immediately. So configuration change request results also > in > >> restarting and applying why pending resource requirements changes > >> > >> > >> Regards, > >> Roman > >> > >> On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote: > >> > >> Hi, Roman. > >> > >> Thanks for the FLIP. > >> +1 for supporting dynamic configuration to reduce manual restart. > >> > >> > >> I just have below questions: > >> > >> 1. Do we need a working configuration list ? So some unsupported > >> configurations could be rejected in advance. > >> > >> 2. Could we show the change history in the Web UI ? So more changed > >> details > >> could be tracked. > >> > >> 3. How does it co-works with other dynamic requests ? For example, it > >> modifies the parallelisms together with ' > >> /jobs/:jobid/resource-requirements'. > >> > >> On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org> > >> wrote: > >> > >> Hi everyone, > >> > >> I would like to start a discussion about FLIP-530: Dynamic job > >> configuration [1]. > >> > >> In some cases, it is desirable to change Flink job configuration after > >> it > >> was submitted to Flink, for example: > >> - Troubleshooting (e.g. increase checkpoint timeout or failure > >> threshold) > >> - Performance optimization, (e.g. tuning state backend parameters) > >> - Enabling new features after testing them in a non-Production > >> environment. > >> This allows to de-couple upgrading to newer Flink versions from > >> actually > >> enabling the features. > >> To support such use-cases, we propose to enhance Flink job > >> configuration > >> REST-endpoint with the support to read full job configuration; and > >> update > >> it. > >> > >> Looking forward to feedback. > >> > >> [1] > >> https://cwiki.apache.org/confluence/x/uglKFQ > >> > >> Regards, > >> Roman > >> > >> > >> > >> -- > >> Best, > >> Hangxiang. > >> > >> > >> > >