Re: [DISCUSS] FLIP-530: Dynamic job configuration

Andrei Kaigorodov Tue, 13 May 2025 00:45:14 -0700

Hi Roman,

Thank you for the proposal. This is a much-needed feature.


One question:

For the PUT request, does it make sense to use a distinct HTTP status code
in the response when the request fails due to a conflicting update? Since
the new expected version field is included in the body, 409 Conflict could
be used in case of a rejection, or alternatively, 412 Precondition Failed
could be used if the version is moved to a header.

I believe this could make the API easier to use programmatically as it
would simplify error handling


Best regards,
Kaigorodov Andrei

On Mon, May 12, 2025 at 7:34 AM Junrui Lee <[email protected]> wrote:

> Hi Roman
>
> Thanks for driving this feature. +1 for this proposal.
>
> I also agree with the suggestion made by Feifan.
>
> Currently, not all configuration items are job-level configurations [1].
> Even for those that are, not all job-level config options can be updated at
> runtime through the Adaptive Scheduler. For instance, certain config option
> related to job plan compilation, such as pipeline.operator-chaining.enabled
> and nearly all of the table.* settings, are not eligible for runtime
> updates.
>
> From a user perspective, it would be beneficial to clearly describe which
> config options can be dynamically updated, allowing users to take better
> advantage of this feature.
>
> Best,
> Junrui
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope
>
> Feifan Wang <[email protected]> 于2025年5月12日周一 11:27写道：
>
> > Thanks Roman for driving this useful improvement, +1 for this proposal.
> >
> > Also thanks discussion from Hangxiang and Rui Fan. Regarding question 1,
> I
> > have some ideas for discussion:
> >
> > Based on the consideration of providing stable expectations for users, I
> > think we should perform configuration checks in a whitelist manner.
> Ensure
> > that the configurations allowed to be modified through this API can
> > actually
> > take effect.
> >
> > In the initial version, we can provide a very small whitelist list, even
> if
> > it only contains a few configurations that we most want to use and have
> > been
> > confirmed to be effective. This list can be continuously supplemented
> > later.
> >
> >
> > ——————————————
> >
> > Best regards,
> > Feifan Wang
> >
> >
> >
> > ---- Replied Message ----
> > | From | Rui Fan<[email protected]> |
> > | Date | 05/11/2025 16:36 |
> > | To | <[email protected]> |
> > | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration |
> > Thanks Roman for driving this valuable proposal, it uses the Adaptive
> > Scheduler to greatly reduce the downtime of configuration updates,
> > so +1 for this proposal!
> >
> > Overall LGTM, thanks to Hangxiang for the questions, and I have the
> > same questions with Hangxiang. I'd like to share my thoughts:
> >
> >
> > For question1 about validation:
> >
> > I think validation is necessary, but both the list of valid
> configurations
> > and
> > the list of invalid configurations have limitations.
> >
> > For valid configurations: IIUC, almost all job level configurations are
> > valid
> > after restarting the job by the adaptive scheduler. It means lots of new
> > configurations need to be added to the list if we list valid
> > configurations.
> > If other developers miss it, the new configuration will fail
> validation(but
> > it works).
> >
> > For invalid configurations: I encountered a problem before, where the
> user
> > added a non-existent flink configuration, but flink could not detect it.
> > It may be caused by typo. Therefore, even if we list some Flink
> > configurations
> > that do not support dynamic modification, we still cannot guarantee that
> > the
> > configurations outside the list will take effect.
> >
> > Even so, I prefer to do limited validation, for example: not through a
> > list,
> > but hard code a few rules (e.g. table.* doesn't work).
> >
> >
> > For question 2 about configuration change history:
> >
> > Logging configuration change history in the first version is fine.
> >
> > As I understand, both of configuration change and resource requirements
> > change
> > could trigger a rescale for Adaptive Scheduler. So rescale history can
> > probably
> > include both. If we want to show the configuration change history, it
> might
> > be
> > more appropriate to put it in FLIP-487[1] and FLIP-495[2].
> >
> > For question 3 about co-works with other dynamic requests:
> >
> > Configuration changes are applied immediately; resource requirements
> > changes are applied with some delay
> >
> > Yes, rescale after some delay could reduce the rescale frequency to avoid
> > some invalid restarts. So I'm curious why configuration changes don't
> > respect the delay mechanism?
> >
> > Please correct me if anything is wrong, thanks!
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
> >
> > Best,
> > Rui
> >
> >
> > On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <[email protected]>
> > wrote:
> >
> > Thanks Hangxiang Yu,
> >
> > Please find the answers below
> >
> > 1. Yes, we should perform validation before trying to update the
> > configuration. I'd rather validate some specific options that are known
> to
> > not work though. Finding and hard-coding all the valid options might be
> > impractical since they can change, and non trivial.
> >
> > 2. That would be great, but we'd have to store the history of such
> updates
> > somewhere. For debugging purposes, logs should suffice I think
> >
> > 3. That's a great question! Configuration changes are applied
> immediately;
> > resource requirements changes are applied with some delay; and both are
> > stored in HA immediately. So configuration change request results also in
> > restarting and applying why pending resource requirements changes
> >
> >
> > Regards,
> > Roman
> >
> > On Fri, May 9, 2025, 05:10 Hangxiang Yu <[email protected]> wrote:
> >
> > Hi, Roman.
> >
> > Thanks for the FLIP.
> > +1 for supporting dynamic configuration to reduce manual restart.
> >
> >
> > I just have below questions:
> >
> > 1. Do we need a working configuration list ? So some unsupported
> > configurations could be rejected in advance.
> >
> > 2. Could we show the change history in the Web UI ? So more changed
> > details
> > could be tracked.
> >
> > 3. How does it co-works with other dynamic requests ? For example, it
> > modifies the parallelisms together with '
> > /jobs/:jobid/resource-requirements'.
> >
> > On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <[email protected]>
> > wrote:
> >
> > Hi everyone,
> >
> > I would like to start a discussion about FLIP-530: Dynamic job
> > configuration [1].
> >
> > In some cases, it is desirable to change Flink job configuration after
> > it
> > was submitted to Flink, for example:
> > - Troubleshooting (e.g. increase checkpoint timeout or failure
> > threshold)
> > - Performance optimization, (e.g. tuning state backend parameters)
> > - Enabling new features after testing them in a non-Production
> > environment.
> > This allows to de-couple upgrading to newer Flink versions from
> > actually
> > enabling the features.
> > To support such use-cases, we propose to enhance Flink job
> > configuration
> > REST-endpoint with the support to read full job configuration; and
> > update
> > it.
> >
> > Looking forward to feedback.
> >
> > [1]
> > https://cwiki.apache.org/confluence/x/uglKFQ
> >
> > Regards,
> > Roman
> >
> >
> >
> > --
> > Best,
> > Hangxiang.
> >
> >
> >
>

Re: [DISCUSS] FLIP-530: Dynamic job configuration

Reply via email to