[
https://issues.apache.org/jira/browse/FLINK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-24386:
----------------------------------
Fix Version/s: 1.15.0
> JobMaster should guard against exceptions from OperatorCoordinator
> ------------------------------------------------------------------
>
> Key: FLINK-24386
> URL: https://issues.apache.org/jira/browse/FLINK-24386
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0, 1.13.2
> Reporter: David Morávek
> Priority: Major
> Fix For: 1.15.0
>
>
> Original report from [~sewen]:
> When the scheduler processes the call to trigger a _globalFailover_
> and something goes wrong in there, the _JobManager_ gets stuck. Concretely,
> I have an _OperatorCoordinator_ that throws an exception in
> _subtaskFailed()_, which gets called as part of processing the failover.
> While this is a bug in that coordinator, the whole thing seems a bit
> dangerous to me. If there is some bug in any part of the failover logic, we
> have no safety net. No "hard crash" and let the process be restarted. We only
> see a log line (below) and everything goes unresponsive.
> {code:java}
> ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught
> exception while executing runnable in main thread.
> {code}
> Shouldn't we have some safety nets in place here?
> * I am wondering if the place where that line is logged should actually
> invoke the fatal error handler. If an exception propagates out of a main
> thread action, we need to call off all bets and assume things have gotten
> inconsistent.
> * At the very least, the failover procedure itself should be guarded. If an
> error happens while processing the global failover, then we need to treat
> this as beyond redemption and declare a fatal error.
> The fatal error would give us a log line and the user a container restart,
> hopefully fixing things (unless it was a deterministic error).
> [~dmvk] notes:
> * OperatorCoordinator is part of a public API interface (part of JobGraph).
> ** Can be provided by implementing CoordinatedOperatorFactory
> ** This actually gives the issue higher priority than I initially thought.
> * We should guard against flaws in user code:
> ** There are two types of interfaces
> *** (CRITICAL) Public API for JobGraph construction / submission
> *** Semi-public interfaces such as custom HA Services, this is for power
> users, so I wouldn't be as concerned there.
> ** We already do good job guarding against failure on TM side
> ** Considering the critical parts on JM side, there two places where user
> can "hook"
> *** OperatorCoordinator
> *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the
> Hadoop world)
> ---
> We should audit all the calls to OperatorCoordinator and handle failures
> accordingly. We want to avoid unnecessary JVM terminations as much as
> possible (sometimes it's the only option though).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)