[
https://issues.apache.org/jira/browse/FLINK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15708682#comment-15708682
]
ASF GitHub Bot commented on FLINK-5193:
---------------------------------------
Github user uce commented on a diff in the pull request:
https://github.com/apache/flink/pull/2910#discussion_r90237601
--- Diff:
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
---
@@ -505,37 +507,31 @@ class JobManager(
}
}
} catch {
- case t: Throwable => log.error(s"Failed to recover job $jobId.",
t)
+ case t: Throwable => log.warn(s"Failed to recover job $jobId.",
t)
}
}(context.dispatcher)
case RecoverAllJobs =>
future {
- try {
- // The ActorRef, which is part of the submitted job graph can
only be
- // de-serialized in the scope of an actor system.
- akka.serialization.JavaSerializer.currentSystem.withValue(
- context.system.asInstanceOf[ExtendedActorSystem]) {
+ log.info("Attempting to recover all jobs.")
- log.info(s"Attempting to recover all jobs.")
-
- val jobGraphs = submittedJobGraphs.recoverJobGraphs().asScala
+ try {
+ val jobIdsToRecover = submittedJobGraphs.getJobIds().asScala
- if (!leaderElectionService.hasLeadership()) {
- // we've lost leadership. mission: abort.
- log.warn(s"Lost leadership during recovery. Aborting
recovery of ${jobGraphs.size} " +
- s"jobs.")
- } else {
- log.info(s"Re-submitting ${jobGraphs.size} job graphs.")
+ if (jobIdsToRecover.isEmpty) {
+ log.info("There are no jobs to recover.")
+ } else {
+ log.info(s"There are ${jobIdsToRecover.size} jobs to recover.
Starting the job " +
--- End diff --
Should we do a `if-else` on the log level here and print the job IDs on
debug?
```
if (isDebug()) {
// There are ${jobIdsToRecover.size} jobs to recover: [jobID]
} else {
// What you already have
}
```
> Recovering all jobs fails completely if a single recovery fails
> ---------------------------------------------------------------
>
> Key: FLINK-5193
> URL: https://issues.apache.org/jira/browse/FLINK-5193
> Project: Flink
> Issue Type: Bug
> Components: JobManager
> Affects Versions: 1.2.0, 1.1.3
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Fix For: 1.2.0, 1.1.4
>
>
> In HA case where the {{JobManager}} tries to recover all submitted job
> graphs, e.g. when regaining leadership, it can happen that none of the
> submitted jobs are recovered if a single recovery fails. Instead of failing
> the complete recovery procedure, the {{JobManager}} should still try to
> recover the remaining (non-failing) jobs and print a proper error message for
> the failed recoveries.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)