Aitozi created FLINK-24063: ------------------------------ Summary: Reconsider the behavior of ClusterEntrypoint#startCluster failure handler Key: FLINK-24063 URL: https://issues.apache.org/jira/browse/FLINK-24063 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Aitozi
If the job runCluster failed, it will trigger the STOP_APPLICATION behavior. But if we consider a case like that: # The JobManager encounter a fatal error like the network problem, which may let the jobManager process down # Then a new process will be started by the resource framework like yarn or kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to the same network problem. # Then the job turn into the FAILED status. This means a streaming job will no longer run due to some fatal error, this is somehow fragile. I think we should give some retry mechanism to prevent the job fast fail twice ,so that deal with some external error which may keep for a period of time. -- This message was sent by Atlassian Jira (v8.3.4#803005)