[ https://issues.apache.org/jira/browse/FLINK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498139#comment-17498139 ]
Biao Geng commented on FLINK-26139: ----------------------------------- After checking the job state transition in Flink, I think in this operator, for our flink application, we may only need to process 3 kinds of states: {{{}RUNNING{}}}, {{SUSPENDED}} and {{{}TERMINATED{}}}. {{{}RUNNING{}}}: The application cluster is running. The actual job status could be any of non_terminal state(i.e. INITIALIZING, CREATED, RUNNING, FAILING, CANCELLING, RESTARTING or RECONCILING). JM is responsible for managing the internal state transistion and the operator just consider it as running. {{{}TERMINATED{}}}: The application is finished due to the completion, failure or cancellation of the job. In this state, the operator should clean up the deployment and all other resources. No potential upgrades any more. {{{}SUSPENDED{}}}: From the API doc: "The job has been suspended which means that it has been stopped but not been removed from a potential HA job store." There could be potential upgrades happened in this state, like modifying the parallism and then the state can be tranferred to {{{}RUNNING{}}}. The state machine can be as simple as the following picture: !image-2022-02-25-21-22-08-636.png|width=200,height=119! I am a little uncertained that if we should split the {{Terminal}} state into more specific states like FAILED, FINISHED and CANCELLED so that we can track the status better in k8s side. Does above design make sense? cc [~wangyang0918] [~gyfora] > Improve JobStatus tracking and handle different job states > ---------------------------------------------------------- > > Key: FLINK-26139 > URL: https://issues.apache.org/jira/browse/FLINK-26139 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Gyula Fora > Priority: Major > Attachments: image-2022-02-25-21-22-08-636.png > > > Currently we do not handle any job status changes such as cancellations, > errors or job completions. > We should introduce some mechanism to react and deal with these changes and > expose them in the status as they can potentially affect upgrades. -- This message was sent by Atlassian Jira (v8.20.1#820001)