[jira] [Updated] (FLINK-25236) Add a mechanism to generate and validate a jobgraph with a checkpoint before submission

Ben Augarten (Jira) Thu, 09 Dec 2021 10:44:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-25236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ben Augarten updated FLINK-25236:
---------------------------------
    Description: 
I've mostly worked on flink 1.9-1.12, but I believe this is still an issue 
today. 

 

I've worked on a few flink applications now that have struggled to reliably 
activate new versions of a currently running job. Sometimes, users make changes 
to a job graph that make it so state cannot be restored. Sometimes users make 
changes to a job graph that make it unable to be scheduled on a given cluster 
(increased parallelism with insufficient task slots on the cluster). These 
validations are [performed 
here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L120]

 

It's not flink's problem that these issues arise, but these issues are only 
detected when the JM tries to run the given jobgraph. For exactly once 
applications (and other applications where running two job graphs for the same 
application is undesirable) there is unneeded downtime when users submit 
jobgraphs with breaking changes because users must cancel the old job, submit 
the new job to see if it is valid and will activate, and then resubmit the old 
job when activation fails. As a user with low-latency requirements, this change 
management solution is unfortunate, and there doesn't seem to be anything 
technical preventing these validations from happening earlier.

 

Suggestion: provide a mechanism for users to (1) create and (2) validate the 
new job graph+checkpoint without running it so that they do not need to cancel 
a currently running version of the job until they're more sure it will activate

  was:
I've mostly worked on flink 1.9-1.12, but I believe this is still an issue 
today. 

 

I've worked on a few flink applications now that have struggled to reliably 
activate new versions of a currently running job. Sometimes, users make changes 
to a job graph that make it so state cannot be restored. Sometimes users make 
changes to a job graph that make it unable to be scheduled on a given cluster 
(increased parallelism with insufficient task slots on the cluster). These 
validations are [performed 
here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L120]

 

It's not flink's problem that these issues arise, but these issues are only 
detected when the JM tries to run the given jobgraph. For exactly once 
applications (and other applications where running two job graphs for the same 
application is undesirable) there is unneeded downtime when users submit 
jobgraphs with breaking changes because users must cancel the old job, submit 
the new job to see if it is valid and will activate, and then resubmit the old 
job when activation fails. As a user with low-latency requirements, this change 
management solution is unfortunate, and there doesn't seem to be anything 
technical preventing these validations from happening earlier.

 

Suggestion: provide a mechanism for users to (1) create and (2) validate the 
new job graph+checkpoint without running it so that they do not need to cancel 
a currently running version of the job until they're sure it will activate


> Add a mechanism to generate and validate a jobgraph with a checkpoint before 
> submission
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-25236
>                 URL: https://issues.apache.org/jira/browse/FLINK-25236
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Ben Augarten
>            Priority: Major
>
> I've mostly worked on flink 1.9-1.12, but I believe this is still an issue 
> today. 
>  
> I've worked on a few flink applications now that have struggled to reliably 
> activate new versions of a currently running job. Sometimes, users make 
> changes to a job graph that make it so state cannot be restored. Sometimes 
> users make changes to a job graph that make it unable to be scheduled on a 
> given cluster (increased parallelism with insufficient task slots on the 
> cluster). These validations are [performed 
> here|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L120]
>  
> It's not flink's problem that these issues arise, but these issues are only 
> detected when the JM tries to run the given jobgraph. For exactly once 
> applications (and other applications where running two job graphs for the 
> same application is undesirable) there is unneeded downtime when users submit 
> jobgraphs with breaking changes because users must cancel the old job, submit 
> the new job to see if it is valid and will activate, and then resubmit the 
> old job when activation fails. As a user with low-latency requirements, this 
> change management solution is unfortunate, and there doesn't seem to be 
> anything technical preventing these validations from happening earlier.
>  
> Suggestion: provide a mechanism for users to (1) create and (2) validate the 
> new job graph+checkpoint without running it so that they do not need to 
> cancel a currently running version of the job until they're more sure it will 
> activate



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (FLINK-25236) Add a mechanism to generate and validate a jobgraph with a checkpoint before submission

Reply via email to