[ 
https://issues.apache.org/jira/browse/FLINK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407984#comment-17407984
 ] 

Till Rohrmann edited comment on FLINK-24063 at 9/1/21, 9:02 AM:
----------------------------------------------------------------

The job will end up in the {{FAILED}} state in step 4 because of 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java#L203.
 I think that's what [~aitozi] would like to change.

What I would like to change is not the {{ApplicationStatus}} but the 
{{ShutdownBehaviour}} to {{ShutdownBehaviour.STOP_PROCESS}} if the 
{{shutDownFuture}} completes exceptionally in the {{runCluster}} method 
(assuming that this future will only completed exceptionally iff an unexpected 
exception occurs) [~wangyang0918]. That way, we won't unregister the 
application but simply restart the process so that another instance can try its 
luck.


was (Author: till.rohrmann):
The job will end up in the {{FAILED}} state in step 4 because of 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java#L203.
 I think that's what [~aitozi] would like to change.

What I would like to change is not the {{ApplicationStatus}} but the 
{{ShutdownBehaviour}} to {{ShutdownBehaviour.STOP_PROCESS}} if the 
{{shutDownFuture}} completes exceptionally (assuming that this future will only 
completed exceptionally iff an unexpected exception occurs) [~wangyang0918]. 
That way, we won't unregister the application but simply restart the process so 
that another instance can try its luck.

> Reconsider the behavior of ClusterEntrypoint#startCluster failure handler
> -------------------------------------------------------------------------
>
>                 Key: FLINK-24063
>                 URL: https://issues.apache.org/jira/browse/FLINK-24063
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Aitozi
>            Priority: Minor
>
> If the job runCluster failed, it will trigger the STOP_APPLICATION behavior. 
> But if we consider a case like that:
>  # A job have running for a long time
>  # Then the JobManager encounter a fatal error like the network problem, 
> which may let the jobManager process down
>  # Then a new process will be started by the resource framework like yarn or 
> kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to 
> the same network problem. 
>  # Then the job turn into the FAILED status.
>  
> This means  a streaming job will no longer run due to some fatal error, this 
> is somehow fragile. I think we should give some retry mechanism to prevent 
> the job fast fail twice ,so that deal with some external error which may keep 
> for a period of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to