David Morávek created FLINK-24386:
-------------------------------------
Summary: JobMaster should guard against exceptions from
OperatorCoordinator
Key: FLINK-24386
URL: https://issues.apache.org/jira/browse/FLINK-24386
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.13.2, 1.14.0
Reporter: David Morávek
Original report from [~sewen]:
When the scheduler processes the call to trigger a _globalFailover_
and something goes wrong in there, the _JobManager_ gets stuck. Concretely, I
have an _OperatorCoordinator_ that throws an exception in _subtaskFailed()_,
which gets called as part of processing the failover.
While this is a bug in that coordinator, the whole thing seems a bit dangerous
to me. If there is some bug in any part of the failover logic, we have no
safety net. No "hard crash" and let the process be restarted. We only see a log
line (below) and everything goes unresponsive.
{code:java}
ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught
exception while executing runnable in main thread.
{code}
Shouldn't we have some safety nets in place here?
* I am wondering if the place where that line is logged should actually invoke
the fatal error handler. If an exception propagates out of a main thread
action, we need to call off all bets and assume things have gotten inconsistent.
* At the very least, the failover procedure itself should be guarded. If an
error happens while processing the global failover, then we need to treat this
as beyond redemption and declare a fatal error.
The fatal error would give us a log line and the user a container restart,
hopefully fixing things (unless it was a deterministic error).
[~dmvk] notes:
* OperatorCoordinator is part of a public API interface (part of JobGraph).
** Can be provided by implementing CoordinatedOperatorFactory
** This actually gives the issue higher priority than I initially thought.
* We should guard against flaws in user code:
** There are two types of interfaces
*** (CRITICAL) Public API for JobGraph construction / submission
*** Semi-public interfaces such as custom HA Services, this is for power
users, so I wouldn't be as concerned there.
** We already do good job guarding against failure on TM side
** Considering the critical parts on JM side, there two places where user can
"hook"
*** OperatorCoordinator
*** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the
Hadoop world)
---
We should audit all the calls to OperatorCoordinator and handle failures
accordingly. We want to avoid unnecessary JVM terminations as much as possible
(sometimes it's the only option though).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)