[
https://issues.apache.org/jira/browse/FLINK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535694#comment-16535694
]
ASF GitHub Bot commented on FLINK-9706:
---------------------------------------
GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/6279
[FLINK-9706] Properly wait for termination of JobManagerRunner before
restarting jobs
## What is the purpose of the change
In order to avoid race conditions between resource clean up, we now wait
for the proper
termination of a previously running JobMaster responsible for the same job
(e.g. originating
from a job recovery or a re-submission).
This PR also fixes
[FLINK-9439](https://issues.apache.org/jira/browse/FLINK-9439).
## Brief change log
- Cache per `JobManagerRunner` the termination future
- Before submitting a job wait for the termination of a previously running
`JobManagerRunner` responsible for the same `JobID`
## Verifying this change
- Added `DispatcherResourceCleanupTest#testJobSubmissionUnderSameJobId` and
`DispatcherResourceCleanupTest#testJobRecoveryWithPendingTermination`
- Before `DispatcherTest#testJobRecovery` and
`DispatcherTest#testSubmittedJobGraphListener` failed due to not properly
waiting for the termination
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (no)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
- The S3 file system connector: (no)
## Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink
fixJobManagerRunnerTermination
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/6279.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6279
----
commit 0e3a19cfa083030f81458dfd36f9bab32d64577a
Author: Till Rohrmann <trohrmann@...>
Date: 2018-07-06T10:38:25Z
[hotfix] Exclude generated Avro types in flink-confluent-schema-registry
from rat check
commit a5d9ff2c16b47b87efc469196c320bd7ba492a95
Author: Till Rohrmann <trohrmann@...>
Date: 2018-07-07T08:53:38Z
[FLINK-9706] Properly wait for termination of JobManagerRunner before
restarting jobs
In order to avoid race conditions between resource clean up, we now wait
for the proper
termination of a previously running JobMaster responsible for the same job
(e.g. originating
from a job recovery or a re-submission).
----
> DispatcherTest#testSubmittedJobGraphListener fails on Travis
> ------------------------------------------------------------
>
> Key: FLINK-9706
> URL: https://issues.apache.org/jira/browse/FLINK-9706
> Project: Flink
> Issue Type: Improvement
> Components: Distributed Coordination, Tests
> Affects Versions: 1.5.0, 1.6.0
> Reporter: Chesnay Schepler
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: pull-request-available, test-stability
> Fix For: 1.5.2, 1.6.0
>
>
> https://travis-ci.org/apache/flink/jobs/399331775
> {code:java}
> testSubmittedJobGraphListener(org.apache.flink.runtime.dispatcher.DispatcherTest)
> Time elapsed: 0.103 sec <<< FAILURE!
> java.lang.AssertionError:
> Expected: a collection with size <1>
> but: collection size was <0>
> at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
> at org.junit.Assert.assertThat(Assert.java:956)
> at org.junit.Assert.assertThat(Assert.java:923)
> at
> org.apache.flink.runtime.dispatcher.DispatcherTest.testSubmittedJobGraphListener(DispatcherTest.java:294)
> testSubmittedJobGraphListener(org.apache.flink.runtime.dispatcher.DispatcherTest)
> Time elapsed: 0.11 sec <<< ERROR!
> org.apache.flink.runtime.util.TestingFatalErrorHandler$TestingException:
> org.apache.flink.runtime.dispatcher.DispatcherException: Could not start the
> added job b8ab3b7fa8a929bf608a5b65896a2b17
> at
> org.apache.flink.runtime.util.TestingFatalErrorHandler.rethrowError(TestingFatalErrorHandler.java:51)
> at
> org.apache.flink.runtime.dispatcher.DispatcherTest.tearDown(DispatcherTest.java:219)
> Caused by: org.apache.flink.runtime.dispatcher.DispatcherException: Could not
> start the added job b8ab3b7fa8a929bf608a5b65896a2b17
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$28(Dispatcher.java:845)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
> at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.util.FlinkException: Failed to submit job
> b8ab3b7fa8a929bf608a5b65896a2b17.
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:254)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836)
> at
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
> at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not
> set up JobManager
> at
> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:901)
> at
> org.apache.flink.runtime.dispatcher.DispatcherTest$ExpectedJobIdJobManagerRunnerFactory.createJobManagerRunner(DispatcherTest.java:603)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:287)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:277)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:262)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:249)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836)
> at
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
> at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.IllegalStateException: No libraries are registered for
> job b8ab3b7fa8a929bf608a5b65896a2b17
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.getClassLoader(BlobLibraryCacheManager.java:175)
> at
> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:137)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:901)
> at
> org.apache.flink.runtime.dispatcher.DispatcherTest$ExpectedJobIdJobManagerRunnerFactory.createJobManagerRunner(DispatcherTest.java:603)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:287)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:277)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:262)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:249)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836)
> at
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
> at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107){code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)