[ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186566#comment-17186566 ]
Till Rohrmann commented on FLINK-18828: --------------------------------------- I believe the question here is whether one considers if a job reaches the {{FAILED}} state a failure of Flink or not. If {{FAILED}} is a valid outcome of a job execution, then the return code should be zero. One could argue then that users should configure their restart strategies to always restart if they don't want to reach a {{FAILED}} state. If one considers the per-job deployment mode the vehicle to run a Flink job, then I can also see that a {{FAILED}} job state can be considered a Flink failure and, hence, one should terminate with a non-zero exit code. Somewhat related to this question is what is causing the job to fail. If it is a user code fault, then I would be more inclined to say that {{FAILED}} is a valid terminal state with a zero exit code because Flink cannot do anything about it. If on the other hand, Flink is causing the job to reach a {{FAILED}} state (e.g. if it does not manage to acquire enough resources, the timeouts are too tight to run successfully on the infrastructure, a fatal error occurs, etc.), then I can see that Flink should terminate with a non-zero exit code indicating that it failed. > Terminate jobmanager process with zero exit code to avoid unexpected > restarting by K8s > -------------------------------------------------------------------------------------- > > Key: FLINK-18828 > URL: https://issues.apache.org/jira/browse/FLINK-18828 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.10.1, 1.12.0, 1.11.1 > Reporter: Yang Wang > Priority: Major > Fix For: 1.12.0, 1.11.2, 1.10.3 > > > Currently, Flink jobmanager process terminates with a non-zero exit code if > the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s > deployment, since non-zero exit code will cause unexpected restarting. Also > from a framework's perspective, a FAILED job does not mean that Flink has > failed and, hence, the return code could still be 0. > > Note: > This is a special case for standalone K8s deployment. For > standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is > harmless. And a non-zero exit code could help to check the job result quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)