Hi Roman, I've been following the FLIP-530 proposal for dynamic job configuration with great interest - thank you for spearheading this valuable initiative. The ability to adjust configurations on-the-fly will undoubtedly be a significant enhancement for Flink.
As I was thinking through its practical application, particularly in production scenarios, one question came to mind regarding the operational recovery path. If a dynamic configuration update is successfully applied via the API, but the new settings inadvertently lead to a job entering a persistent failure loop (perhaps due to a parameter that passes initial validation but causes runtime issues), I was wondering what the envisioned process would be to manage and recover from such a situation? Would the typical approach involve another PUT request to revert to a known-good configuration or apply a corrective one? Understanding the intended recovery mechanics here would be very helpful. Thank you again for your excellent work on this FLIP. Best regards, Kartikey Pant. On Fri, May 9, 2025 at 2:30 AM Roman Khachatryan <ro...@apache.org> wrote: > Hi everyone, > > I would like to start a discussion about FLIP-530: Dynamic job > configuration [1]. > > In some cases, it is desirable to change Flink job configuration after it > was submitted to Flink, for example: > - Troubleshooting (e.g. increase checkpoint timeout or failure threshold) > - Performance optimization, (e.g. tuning state backend parameters) > - Enabling new features after testing them in a non-Production environment. > This allows to de-couple upgrading to newer Flink versions from actually > enabling the features. > To support such use-cases, we propose to enhance Flink job configuration > REST-endpoint with the support to read full job configuration; and update > it. > > Looking forward to feedback. > > [1] > https://cwiki.apache.org/confluence/x/uglKFQ > > Regards, > Roman >