[ 
https://issues.apache.org/jira/browse/FLINK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562909#comment-17562909
 ] 

Jiangjie Qin commented on FLINK-28411:
--------------------------------------

[~martijnvisser] Yes, I think this is a problem. I need to check a bit more on 
the JM initialization logic to see what is the best fix. As Stephan mentioned 
in the other ticket, putting the initialization in the enumerator thread could 
be an option. It postpones the exception handling to after the JM 
initialization finishes, at which point the JM will be able to handle the per 
job global failure correctly.

> OperatorCoordinator exception may fail Session Cluster
> ------------------------------------------------------
>
>                 Key: FLINK-28411
>                 URL: https://issues.apache.org/jira/browse/FLINK-28411
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Common
>            Reporter: Daren Wong
>            Priority: Major
>             Fix For: 1.15.2
>
>
> Part of Scheduler's startScheduling procedure involves starting all 
> OperatorCoordinatorHolder, and when one of the OperatorCoordinator fails to 
> start, the exception is forwarded up the stack triggering a JobMaster 
> failover. However, JobMaster failover only works if HA is enabled[1]. If HA 
> is not enabled the fatal error handler will simply exit the JM process 
> killing the entire cluster. This is problematic in the case of a session 
> cluster where there may be multiple jobs running. It also does not play well 
> with external tooling that does not expect job failure to cause a full 
> cluster failure. 
>  
> It would be preferable if failure to start an OperatorCoordinator did not 
> take down the entire cluster, but instead failed that particular job. 
>  
> This issue is similar to https://issues.apache.org/jira/browse/FLINK-24303 
> which fix this issue for a SourceCoordinator specifically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to