[jira] [Commented] (FLINK-24063) Reconsider the behavior of ClusterEntrypoint#startCluster failure handler

Yang Wang (Jira) Tue, 31 Aug 2021 23:50:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407880#comment-17407880
 ]


Yang Wang commented on FLINK-24063:
-----------------------------------

[~aitozi] IIUC, you mean the {{STOP_APPLICATION}} in the {{runCluster}}, not 
the {{startCluster}}. Right?

Actually, I am not fully understand step 3 and 4. If the JobManager has some 
network issues, it might come across the fatal error and get restarted to 
another machine. This is the expected behavior. Then the JobManager should work 
well and recover from latest successful checkpoint. Do you mean the underlying 
resource framework is not aware of such network issue and keep scheduling to 
the same node?

In step 4, why the job goes into the {{FAILED}} status. AFAIK, JobManager 
restarting should not affect the job status.

 

[~trohrmann], I agree with you that maybe not all the exceptions when 
{{clusterComponent#shutDownFuture}} completes should trigger the 
{{STOP_APPLICATION}}. I have tried to introduce such behavior in this PR[1]. I 
am wondering what is your case that we should not stop the application even the 
future completes with exception.

[1]. 
https://github.com/apache/flink/pull/16121/files#diff-74b961fb51624f7a964de7e538c545fce7b2cf02cdc080aaa779d009aa51cb80R270

> Reconsider the behavior of ClusterEntrypoint#startCluster failure handler
> -------------------------------------------------------------------------
>
>                 Key: FLINK-24063
>                 URL: https://issues.apache.org/jira/browse/FLINK-24063
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Aitozi
>            Priority: Minor
>
> If the job runCluster failed, it will trigger the STOP_APPLICATION behavior. 
> But if we consider a case like that:
>  # A job have running for a long time
>  # Then the JobManager encounter a fatal error like the network problem, 
> which may let the jobManager process down
>  # Then a new process will be started by the resource framework like yarn or 
> kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to 
> the same network problem. 
>  # Then the job turn into the FAILED status.
>  
> This means  a streaming job will no longer run due to some fatal error, this 
> is somehow fragile. I think we should give some retry mechanism to prevent 
> the job fast fail twice ,so that deal with some external error which may keep 
> for a period of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24063) Reconsider the behavior of ClusterEntrypoint#startCluster failure handler

Reply via email to