Thanks all for the feedback! @ Rui Fan These are very good points, I agree. In the PoC, we've actually implemented validation of some specific options that can't be changed (in addition to the white list). I will look at the FLIPs you referred to.
@Feifan Wang, Junrui Lee W.r.t the white-list, I've added "jobmanager.execution.dynamic-configuration.allow-list: *" >From my experience, options related to checkpointing and failure handling are useful. Do you have any particular options in mind to white-list initially? @Andrei Kaigorodov You're right, 409 would be more appropriate. I've updated the document. Regards, Roman On Sun, May 11, 2025 at 10:36 AM Rui Fan <1996fan...@gmail.com> wrote: > Thanks Roman for driving this valuable proposal, it uses the Adaptive > Scheduler to greatly reduce the downtime of configuration updates, > so +1 for this proposal! > > Overall LGTM, thanks to Hangxiang for the questions, and I have the > same questions with Hangxiang. I'd like to share my thoughts: > > > For question1 about validation: > > I think validation is necessary, but both the list of valid configurations > and > the list of invalid configurations have limitations. > > For valid configurations: IIUC, almost all job level configurations are > valid > after restarting the job by the adaptive scheduler. It means lots of new > configurations need to be added to the list if we list valid > configurations. > If other developers miss it, the new configuration will fail validation(but > it works). > > For invalid configurations: I encountered a problem before, where the user > added a non-existent flink configuration, but flink could not detect it. > It may be caused by typo. Therefore, even if we list some Flink > configurations > that do not support dynamic modification, we still cannot guarantee that > the > configurations outside the list will take effect. > > Even so, I prefer to do limited validation, for example: not through a > list, > but hard code a few rules (e.g. table.* doesn't work). > > > For question 2 about configuration change history: > > Logging configuration change history in the first version is fine. > > As I understand, both of configuration change and resource requirements > change > could trigger a rescale for Adaptive Scheduler. So rescale history can > probably > include both. If we want to show the configuration change history, it might > be > more appropriate to put it in FLIP-487[1] and FLIP-495[2]. > > For question 3 about co-works with other dynamic requests: > > > Configuration changes are applied immediately; resource requirements > changes are applied with some delay > > Yes, rescale after some delay could reduce the rescale frequency to avoid > some invalid restarts. So I'm curious why configuration changes don't > respect the delay mechanism? > > Please correct me if anything is wrong, thanks! > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > Best, > Rui > > > On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org> > wrote: > > > Thanks Hangxiang Yu, > > > > Please find the answers below > > > > 1. Yes, we should perform validation before trying to update the > > configuration. I'd rather validate some specific options that are known > to > > not work though. Finding and hard-coding all the valid options might be > > impractical since they can change, and non trivial. > > > > 2. That would be great, but we'd have to store the history of such > updates > > somewhere. For debugging purposes, logs should suffice I think > > > > 3. That's a great question! Configuration changes are applied > immediately; > > resource requirements changes are applied with some delay; and both are > > stored in HA immediately. So configuration change request results also in > > restarting and applying why pending resource requirements changes > > > > > > Regards, > > Roman > > > > On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote: > > > > > Hi, Roman. > > > > > > Thanks for the FLIP. > > > +1 for supporting dynamic configuration to reduce manual restart. > > > > > > > > > I just have below questions: > > > > > > 1. Do we need a working configuration list ? So some unsupported > > > configurations could be rejected in advance. > > > > > > 2. Could we show the change history in the Web UI ? So more changed > > details > > > could be tracked. > > > > > > 3. How does it co-works with other dynamic requests ? For example, it > > > modifies the parallelisms together with ' > > > /jobs/:jobid/resource-requirements'. > > > > > > On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org> > > wrote: > > > > > > > Hi everyone, > > > > > > > > I would like to start a discussion about FLIP-530: Dynamic job > > > > configuration [1]. > > > > > > > > In some cases, it is desirable to change Flink job configuration > after > > it > > > > was submitted to Flink, for example: > > > > - Troubleshooting (e.g. increase checkpoint timeout or failure > > threshold) > > > > - Performance optimization, (e.g. tuning state backend parameters) > > > > - Enabling new features after testing them in a non-Production > > > environment. > > > > This allows to de-couple upgrading to newer Flink versions from > > actually > > > > enabling the features. > > > > To support such use-cases, we propose to enhance Flink job > > configuration > > > > REST-endpoint with the support to read full job configuration; and > > update > > > > it. > > > > > > > > Looking forward to feedback. > > > > > > > > [1] > > > > https://cwiki.apache.org/confluence/x/uglKFQ > > > > > > > > Regards, > > > > Roman > > > > > > > > > > > > > -- > > > Best, > > > Hangxiang. > > > > > >