Daren Wong created FLINK-28411:
----------------------------------
Summary: OperatorCoordinator exception may fail Session Cluster
Key: FLINK-28411
URL: https://issues.apache.org/jira/browse/FLINK-28411
Project: Flink
Issue Type: Bug
Components: Connectors / Common
Reporter: Daren Wong
Fix For: 1.15.2
Part of Scheduler's startScheduling procedure involves starting all
OperatorCoordinatorHolder, and when one of the OperatorCoordinator fails to
start, the exception is forwarded up the stack triggering a JobMaster failover.
However, JobMaster failover only works if HA is enabled[1]. If HA is not
enabled the fatal error handler will simply exit the JM process killing the
entire cluster. This is problematic in the case of a session cluster where
there may be multiple jobs running. It also does not play well with external
tooling that does not expect job failure to cause a full cluster failure.
It would be preferable if failure to start an OperatorCoordinator did not take
down the entire cluster, but instead failed that particular job.
This issue is similar to https://issues.apache.org/jira/browse/FLINK-24303
which fix this issue for a SourceCoordinator specifically.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)