Thanks Roman for driving this useful improvement, +1 for this proposal. Also thanks discussion from Hangxiang and Rui Fan. Regarding question 1, I have some ideas for discussion:
Based on the consideration of providing stable expectations for users, I think we should perform configuration checks in a whitelist manner. Ensure that the configurations allowed to be modified through this API can actually take effect. In the initial version, we can provide a very small whitelist list, even if it only contains a few configurations that we most want to use and have been confirmed to be effective. This list can be continuously supplemented later. —————————————— Best regards, Feifan Wang ---- Replied Message ---- | From | Rui Fan<1996fan...@gmail.com> | | Date | 05/11/2025 16:36 | | To | <dev@flink.apache.org> | | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration | Thanks Roman for driving this valuable proposal, it uses the Adaptive Scheduler to greatly reduce the downtime of configuration updates, so +1 for this proposal! Overall LGTM, thanks to Hangxiang for the questions, and I have the same questions with Hangxiang. I'd like to share my thoughts: For question1 about validation: I think validation is necessary, but both the list of valid configurations and the list of invalid configurations have limitations. For valid configurations: IIUC, almost all job level configurations are valid after restarting the job by the adaptive scheduler. It means lots of new configurations need to be added to the list if we list valid configurations. If other developers miss it, the new configuration will fail validation(but it works). For invalid configurations: I encountered a problem before, where the user added a non-existent flink configuration, but flink could not detect it. It may be caused by typo. Therefore, even if we list some Flink configurations that do not support dynamic modification, we still cannot guarantee that the configurations outside the list will take effect. Even so, I prefer to do limited validation, for example: not through a list, but hard code a few rules (e.g. table.* doesn't work). For question 2 about configuration change history: Logging configuration change history in the first version is fine. As I understand, both of configuration change and resource requirements change could trigger a rescale for Adaptive Scheduler. So rescale history can probably include both. If we want to show the configuration change history, it might be more appropriate to put it in FLIP-487[1] and FLIP-495[2]. For question 3 about co-works with other dynamic requests: Configuration changes are applied immediately; resource requirements changes are applied with some delay Yes, rescale after some delay could reduce the rescale frequency to avoid some invalid restarts. So I'm curious why configuration changes don't respect the delay mechanism? Please correct me if anything is wrong, thanks! [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history Best, Rui On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org> wrote: Thanks Hangxiang Yu, Please find the answers below 1. Yes, we should perform validation before trying to update the configuration. I'd rather validate some specific options that are known to not work though. Finding and hard-coding all the valid options might be impractical since they can change, and non trivial. 2. That would be great, but we'd have to store the history of such updates somewhere. For debugging purposes, logs should suffice I think 3. That's a great question! Configuration changes are applied immediately; resource requirements changes are applied with some delay; and both are stored in HA immediately. So configuration change request results also in restarting and applying why pending resource requirements changes Regards, Roman On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote: Hi, Roman. Thanks for the FLIP. +1 for supporting dynamic configuration to reduce manual restart. I just have below questions: 1. Do we need a working configuration list ? So some unsupported configurations could be rejected in advance. 2. Could we show the change history in the Web UI ? So more changed details could be tracked. 3. How does it co-works with other dynamic requests ? For example, it modifies the parallelisms together with ' /jobs/:jobid/resource-requirements'. On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org> wrote: Hi everyone, I would like to start a discussion about FLIP-530: Dynamic job configuration [1]. In some cases, it is desirable to change Flink job configuration after it was submitted to Flink, for example: - Troubleshooting (e.g. increase checkpoint timeout or failure threshold) - Performance optimization, (e.g. tuning state backend parameters) - Enabling new features after testing them in a non-Production environment. This allows to de-couple upgrading to newer Flink versions from actually enabling the features. To support such use-cases, we propose to enhance Flink job configuration REST-endpoint with the support to read full job configuration; and update it. Looking forward to feedback. [1] https://cwiki.apache.org/confluence/x/uglKFQ Regards, Roman -- Best, Hangxiang.