[ https://issues.apache.org/jira/browse/FLINK-21030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272108#comment-17272108 ]
Matthias commented on FLINK-21030: ---------------------------------- Just to clarify: The expected behavior is then that the command still fails but the job will resume from the most recent checkpoint after this. I would proceed with [~zhuzh]'s proposal which should be straight forward. I was also looking into where we could test this behavior. Firstly, I identified {{JobMasterStopWithSavepointIT}} as it seems to collect all the stop-with-savepoint related usecases. Unfortunately, this test class was unmaintained for a while due to wrong naming and is failing right now (this is covered by FLINK-21031). Alternatively, I'd propose adding the test in {{SavepointITCase}} to have FLINK-21030 not being depending on FLINK-21031. Do you have any objections against that? > Broken job restart for job with disjoint graph > ---------------------------------------------- > > Key: FLINK-21030 > URL: https://issues.apache.org/jira/browse/FLINK-21030 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.2 > Reporter: Theo Diefenthal > Assignee: Matthias > Priority: Blocker > Fix For: 1.13.0, 1.11.4, 1.12.2 > > > Building on top of bugs: > https://issues.apache.org/jira/browse/FLINK-21028 > and https://issues.apache.org/jira/browse/FLINK-21029 : > I tried to stop a Flink application on YARN via savepoint which didn't > succeed due to a possible bug/racecondition in shutdown (Bug 21028). Due to > some reason, Flink attempted to restart the pipeline after the failure in > shutdown (21029). The bug here: > As I mentioned: My jobgraph is disjoint and the pipelines are fully isolated. > Lets say the original error occured in a single task of pipeline1. Flink then > restarted the entire pipeline1, but pipeline2 was shutdown successfully and > switched the state to FINISHED. > My job thus was in kind of an invalid state after the attempt to stopping: > One of two pipelines was running, the other was FINISHED. I guess this is > kind of a bug in the restarting behavior that only all connected components > of a graph are restarted, but the others aren't... -- This message was sent by Atlassian Jira (v8.3.4#803005)