[jira] [Commented] (FLINK-24303) SourceCoordinator exception may fail Session Cluster

Stephan Ewen (Jira) Mon, 20 Sep 2021 04:49:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417603#comment-17417603
 ]


Stephan Ewen commented on FLINK-24303:
--------------------------------------

In the longer term, I would like to change the Source Coordinator code such 
that all enumerator creation and restore happens actually in the enumerator 
thread. That makes the failure handling cleaner (no exceptions can happen 
during the startup phase) and also solves the issue what we move expensive 
enumerator initialization out of the JobManager startup phase.

> SourceCoordinator exception may fail Session Cluster
> ----------------------------------------------------
>
>                 Key: FLINK-24303
>                 URL: https://issues.apache.org/jira/browse/FLINK-24303
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Common
>            Reporter: Seth Wiesman
>            Assignee: Stephan Ewen
>            Priority: Blocker
>             Fix For: 1.14.0
>
>
> The SourceCoordinator currently forwards all exceptions from 
> `Source#createEnumerator` up the stack triggering a JobMaster failover. 
> However, JobMaster failover only works if HA is enabled[1]. If HA is not 
> enabled the fatal error handler will simply exit the JM process killing the 
> entire cluster. This is problematic in the case of a session cluster where 
> there may be multiple jobs running. It also does not play well with external 
> tooling that does not expect job failure to cause a full cluster failure. 
>  
> It would be preferable if failure to create an enumerator did not take down 
> the entire cluster, but instead failed that particular job. 
>  
> [1] 
> [https://github.com/apache/flink/blob/7f69331294ab2ab73f77b40a4320cdda53246afe/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L898-L903]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24303) SourceCoordinator exception may fail Session Cluster

Reply via email to