Hi Roman,

I've been following the FLIP-530 proposal for dynamic job configuration
with great interest - thank you for spearheading this valuable initiative.
The ability to adjust configurations on-the-fly will undoubtedly be a
significant enhancement for Flink.

As I was thinking through its practical application, particularly in
production scenarios, one question came to mind regarding the operational
recovery path. If a dynamic configuration update is successfully applied
via the API, but the new settings inadvertently lead to a job entering a
persistent failure loop (perhaps due to a parameter that passes initial
validation but causes runtime issues), I was wondering what the envisioned
process would be to manage and recover from such a situation? Would the
typical approach involve another PUT request to revert to a known-good
configuration or apply a corrective one?

Understanding the intended recovery mechanics here would be very helpful.

Thank you again for your excellent work on this FLIP.

Best regards,

Kartikey Pant.

On Fri, May 9, 2025 at 2:30 AM Roman Khachatryan <ro...@apache.org> wrote:

> Hi everyone,
>
> I would like to start a discussion about FLIP-530: Dynamic job
> configuration [1].
>
> In some cases, it is desirable to change Flink job configuration after it
> was submitted to Flink, for example:
> - Troubleshooting (e.g. increase checkpoint timeout or failure threshold)
> - Performance optimization, (e.g. tuning state backend parameters)
> - Enabling new features after testing them in a non-Production environment.
> This allows to de-couple upgrading to newer Flink versions from actually
> enabling the features.
> To support such use-cases, we propose to enhance Flink job configuration
> REST-endpoint with the support to read full job configuration; and update
>  it.
>
> Looking forward to feedback.
>
> [1]
> https://cwiki.apache.org/confluence/x/uglKFQ
>
> Regards,
> Roman
>

Reply via email to