Ben Augarten created FLINK-25236:
------------------------------------
Summary: Add a mechanism to generate and validate a jobgraph with
a checkpoint before submission
Key: FLINK-25236
URL: https://issues.apache.org/jira/browse/FLINK-25236
Project: Flink
Issue Type: Improvement
Reporter: Ben Augarten
I've mostly worked on flink 1.9-1.12, but I believe this is still an issue
today.
I've worked on a few flink applications now that have struggled to reliably
activate new versions of a currently running job. Sometimes, users make changes
to a job graph that make it so state cannot be restored. Sometimes users make
changes to a job graph that make it unable to be scheduled on a given cluster
(increased parallelism with insufficient task slots on the cluster). These
validations are [performed
here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L120]
It's not flink's problem that these issues arise, but these issues are only
detected when the JM tries to run the given jobgraph. For exactly once
applications (and other applications where running two job graphs for the same
application is undesirable) there is unneeded downtime when users submit
jobgraphs with breaking changes because users must cancel the old job, submit
the new job to see if it is valid and will activate, and then resubmit the old
job when activation fails. As a user with low-latency requirements, this change
management solution is unfortunate, and there doesn't seem to be anything
technical preventing these validations from happening earlier.
Suggestion: provide a mechanism for users to (1) create and (2) validate the
new job graph+checkpoint without running it so that they do not need to cancel
a currently running version of the job until they're sure it will activate
--
This message was sent by Atlassian Jira
(v8.20.1#820001)