[
https://issues.apache.org/jira/browse/FLINK-20382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243197#comment-17243197
]
Stephan Ewen commented on FLINK-20382:
--------------------------------------
For the suggestion from [~rmetzger], a few questions:
As far as I can tell, the "startScheduling()" method of the scheduler is called
only once. So if the exception there is caught and triggers a global failure,
then we will never try to start the coordinator again. So that sounds like it
is not a feasible solution.
[~trohrmann] Could you confirm that?
I would say, let keep the "fatal error is start()" behavior for now, because
the sources should (in the future) no longer do anything in "start()" that
fails.
> Exception thrown from JobMaster.startScheduling() may be ignored.
> -----------------------------------------------------------------
>
> Key: FLINK-20382
> URL: https://issues.apache.org/jira/browse/FLINK-20382
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.11.2
> Reporter: Jiangjie Qin
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.12.0, 1.11.3
>
>
> Currently {{JobMaster.resetAndStartScheduler()}} invokes
> {{startScheduling()}} in a {{thenRun}} clause without {{exceptionally}} or
> {{handle}} to handle exceptions. The job may hang if an exception is thrown
> when starting scheduling, e.g. failed to create operator coordinators.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)