[
https://issues.apache.org/jira/browse/FLINK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407880#comment-17407880
]
Yang Wang commented on FLINK-24063:
-----------------------------------
[~aitozi] IIUC, you mean the {{STOP_APPLICATION}} in the {{runCluster}}, not
the {{startCluster}}. Right?
Actually, I am not fully understand step 3 and 4. If the JobManager has some
network issues, it might come across the fatal error and get restarted to
another machine. This is the expected behavior. Then the JobManager should work
well and recover from latest successful checkpoint. Do you mean the underlying
resource framework is not aware of such network issue and keep scheduling to
the same node?
In step 4, why the job goes into the {{FAILED}} status. AFAIK, JobManager
restarting should not affect the job status.
[~trohrmann], I agree with you that maybe not all the exceptions when
{{clusterComponent#shutDownFuture}} completes should trigger the
{{STOP_APPLICATION}}. I have tried to introduce such behavior in this PR[1]. I
am wondering what is your case that we should not stop the application even the
future completes with exception.
[1].
https://github.com/apache/flink/pull/16121/files#diff-74b961fb51624f7a964de7e538c545fce7b2cf02cdc080aaa779d009aa51cb80R270
> Reconsider the behavior of ClusterEntrypoint#startCluster failure handler
> -------------------------------------------------------------------------
>
> Key: FLINK-24063
> URL: https://issues.apache.org/jira/browse/FLINK-24063
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Aitozi
> Priority: Minor
>
> If the job runCluster failed, it will trigger the STOP_APPLICATION behavior.
> But if we consider a case like that:
> # A job have running for a long time
> # Then the JobManager encounter a fatal error like the network problem,
> which may let the jobManager process down
> # Then a new process will be started by the resource framework like yarn or
> kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to
> the same network problem.
> # Then the job turn into the FAILED status.
>
> This means a streaming job will no longer run due to some fatal error, this
> is somehow fragile. I think we should give some retry mechanism to prevent
> the job fast fail twice ,so that deal with some external error which may keep
> for a period of time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)