David Morávek created FLINK-24386:
-------------------------------------

             Summary: JobMaster should guard against exceptions from 
OperatorCoordinator
                 Key: FLINK-24386
                 URL: https://issues.apache.org/jira/browse/FLINK-24386
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.13.2, 1.14.0
            Reporter: David Morávek


Original report from [~sewen]:

When the scheduler processes the call to trigger a _globalFailover_
 and something goes wrong in there, the _JobManager_ gets stuck. Concretely, I 
have an _OperatorCoordinator_ that throws an exception in _subtaskFailed()_, 
which gets called as part of processing the failover.

While this is a bug in that coordinator, the whole thing seems a bit dangerous 
to me. If there is some bug in any part of the failover logic, we have no 
safety net. No "hard crash" and let the process be restarted. We only see a log 
line (below) and everything goes unresponsive.
{code:java}
ERROR org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor [] - Caught 
exception while executing runnable in main thread.
{code}
Shouldn't we have some safety nets in place here?
 * I am wondering if the place where that line is logged should actually invoke 
the fatal error handler. If an exception propagates out of a main thread 
action, we need to call off all bets and assume things have gotten inconsistent.
 * At the very least, the failover procedure itself should be guarded. If an 
error happens while processing the global failover, then we need to treat this 
as beyond redemption and declare a fatal error.

The fatal error would give us a log line and the user a container restart, 
hopefully fixing things (unless it was a deterministic error).

[~dmvk] notes:
 * OperatorCoordinator is part of a public API interface (part of JobGraph).
 ** Can be provided by implementing CoordinatedOperatorFactory
 ** This actually gives the issue higher priority than I initially thought.
 * We should guard against flaws in user code:
 ** There are two types of interfaces
 *** (CRITICAL) Public API for JobGraph construction / submission
 *** Semi-public interfaces such as custom HA Services, this is for power 
users, so I wouldn't be as concerned there.
 ** We already do good job guarding against failure on TM side
 ** Considering the critical parts on JM side, there two places where user can 
"hook"
 *** OperatorCoordinator
 *** InitializeOnMaster, FinalizeOnMaster (batch sinks only, legacy from the 
Hadoop world)

--- 

We should audit all the calls to OperatorCoordinator and handle failures 
accordingly. We want to avoid unnecessary JVM terminations as much as possible 
(sometimes it's the only option though).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to