Thanks Roman for driving this useful improvement, +1 for this proposal.
Also thanks discussion from Hangxiang and Rui Fan. Regarding question 1, I
have some ideas for discussion:
Based on the consideration of providing stable expectations for users, I
think we should perform configuration checks in a whitelist manner. Ensure
that the configurations allowed to be modified through this API can
actually
take effect.
In the initial version, we can provide a very small whitelist list, even if
it only contains a few configurations that we most want to use and have
been
confirmed to be effective. This list can be continuously supplemented
later.
——————————————
Best regards,
Feifan Wang
---- Replied Message ----
| From | Rui Fan<1996fan...@gmail.com> |
| Date | 05/11/2025 16:36 |
| To | <dev@flink.apache.org> |
| Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration |
Thanks Roman for driving this valuable proposal, it uses the Adaptive
Scheduler to greatly reduce the downtime of configuration updates,
so +1 for this proposal!
Overall LGTM, thanks to Hangxiang for the questions, and I have the
same questions with Hangxiang. I'd like to share my thoughts:
For question1 about validation:
I think validation is necessary, but both the list of valid configurations
and
the list of invalid configurations have limitations.
For valid configurations: IIUC, almost all job level configurations are
valid
after restarting the job by the adaptive scheduler. It means lots of new
configurations need to be added to the list if we list valid
configurations.
If other developers miss it, the new configuration will fail validation(but
it works).
For invalid configurations: I encountered a problem before, where the user
added a non-existent flink configuration, but flink could not detect it.
It may be caused by typo. Therefore, even if we list some Flink
configurations
that do not support dynamic modification, we still cannot guarantee that
the
configurations outside the list will take effect.
Even so, I prefer to do limited validation, for example: not through a
list,
but hard code a few rules (e.g. table.* doesn't work).
For question 2 about configuration change history:
Logging configuration change history in the first version is fine.
As I understand, both of configuration change and resource requirements
change
could trigger a rescale for Adaptive Scheduler. So rescale history can
probably
include both. If we want to show the configuration change history, it might
be
more appropriate to put it in FLIP-487[1] and FLIP-495[2].
For question 3 about co-works with other dynamic requests:
Configuration changes are applied immediately; resource requirements
changes are applied with some delay
Yes, rescale after some delay could reduce the rescale frequency to avoid
some invalid restarts. So I'm curious why configuration changes don't
respect the delay mechanism?
Please correct me if anything is wrong, thanks!
[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
Best,
Rui
On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org>
wrote:
Thanks Hangxiang Yu,
Please find the answers below
1. Yes, we should perform validation before trying to update the
configuration. I'd rather validate some specific options that are known to
not work though. Finding and hard-coding all the valid options might be
impractical since they can change, and non trivial.
2. That would be great, but we'd have to store the history of such updates
somewhere. For debugging purposes, logs should suffice I think
3. That's a great question! Configuration changes are applied immediately;
resource requirements changes are applied with some delay; and both are
stored in HA immediately. So configuration change request results also in
restarting and applying why pending resource requirements changes
Regards,
Roman
On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote:
Hi, Roman.
Thanks for the FLIP.
+1 for supporting dynamic configuration to reduce manual restart.
I just have below questions:
1. Do we need a working configuration list ? So some unsupported
configurations could be rejected in advance.
2. Could we show the change history in the Web UI ? So more changed
details
could be tracked.
3. How does it co-works with other dynamic requests ? For example, it
modifies the parallelisms together with '
/jobs/:jobid/resource-requirements'.
On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org>
wrote:
Hi everyone,
I would like to start a discussion about FLIP-530: Dynamic job
configuration [1].
In some cases, it is desirable to change Flink job configuration after
it
was submitted to Flink, for example:
- Troubleshooting (e.g. increase checkpoint timeout or failure
threshold)
- Performance optimization, (e.g. tuning state backend parameters)
- Enabling new features after testing them in a non-Production
environment.
This allows to de-couple upgrading to newer Flink versions from
actually
enabling the features.
To support such use-cases, we propose to enhance Flink job
configuration
REST-endpoint with the support to read full job configuration; and
update
it.
Looking forward to feedback.
[1]
https://cwiki.apache.org/confluence/x/uglKFQ
Regards,
Roman
--
Best,
Hangxiang.