[
https://issues.apache.org/jira/browse/FLINK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407082#comment-17407082
]
Aitozi edited comment on FLINK-24063 at 8/31/21, 5:40 AM:
----------------------------------------------------------
Looking forward to your opinion on this [[email protected]]
was (Author: aitozi):
Looking forward to your opinion on this [~trohrmann]
> Reconsider the behavior of ClusterEntrypoint#startCluster failure handler
> -------------------------------------------------------------------------
>
> Key: FLINK-24063
> URL: https://issues.apache.org/jira/browse/FLINK-24063
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Aitozi
> Priority: Minor
>
> If the job runCluster failed, it will trigger the STOP_APPLICATION behavior.
> But if we consider a case like that:
> # A job have running for a long time
> # Then the JobManager encounter a fatal error like the network problem,
> which may let the jobManager process down
> # Then a new process will be started by the resource framework like yarn or
> kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to
> the same network problem.
> # Then the job turn into the FAILED status.
>
> This means a streaming job will no longer run due to some fatal error, this
> is somehow fragile. I think we should give some retry mechanism to prevent
> the job fast fail twice ,so that deal with some external error which may keep
> for a period of time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)