[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

Yang Wang (Jira) Fri, 28 Aug 2020 01:16:39 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186371#comment-17186371
 ]


Yang Wang commented on FLINK-18828:
-----------------------------------

Thanks for your attention about this ticket. You could find more discussion 
information here[1].

Actually, setting the {{restartPolicy}} to {{Never}} could help with the 
problem. But it could not solve it completely. The ideal behavior is like this. 
If the user configured Flink restart strategy(e.g. fixed-delay with 3 attempts) 
is not exhausted, we should restart the JobManager pod. And vice, we do not 
restart the JobManager pod. 

Since we could not specify the {{restartPolicy}} based on exit code[2], i 
suggest to exit the JobManager process with zero code. Another reason is that a 
Flink job failed does not mean we have to terminate the cluster with non-zero 
code. It is an expected terminating.

 

[1]. 
[https://lists.apache.org/thread.html/r97cd9707cedf92667e4c34182bac6ce2e6c34bc2f1b64d93123ff16a%40%3Cuser.flink.apache.org%3E]

[2]. 

[https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code]

> Terminate jobmanager process with zero exit code to avoid unexpected 
> restarting by K8s
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-18828
>                 URL: https://issues.apache.org/jira/browse/FLINK-18828
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.1, 1.12.0, 1.11.1
>            Reporter: Yang Wang
>            Priority: Major
>             Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if 
> the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s 
> deployment, since non-zero exit code will cause unexpected restarting. Also 
> from a framework's perspective, a FAILED job does not mean that Flink has 
> failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For 
> standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is 
> harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

Reply via email to