[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

Thomas Wozniakowski (JIRA) Mon, 20 Aug 2018 09:36:12 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586188#comment-16586188
 ]


Thomas Wozniakowski commented on FLINK-10184:
---------------------------------------------

Hi [~elevy],

I don't believe it is the same issue (though it may be related). In that issue, 
the jobs are actually successfully recovered (and just fail due to an absence 
of task slots). In our case, the actual Job Manager immediately dies with logs 
like this:

{quote}
2018-08-20 16:29:04,535 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error 
occurred in the cluster entrypoint.
java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Could not set up 
JobManager
        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
        at 
org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820)
        at 
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
        at 
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
        at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set 
up JobManager
        at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
        at 
org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
        ... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: No such 
file or directory: 
s3://ew1-integration-pattern-nsbucket-18jn-flinkbucket-1his9qugdhp03/flink/cluster_one/blob/job_4e9a5a9d70ca99dbd394c35f8dfeda65/blob_p-fa5168561c98e3005a724cb817a1ec1a0b3bd3eb-03a884a908837dc8b5a387fb502afa2f
        at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
        ... 25 more
Caused by: java.io.FileNotFoundException: No such file or directory: 
s3://ew1-integration-pattern-nsbucket-18jn-flinkbucket-1his9qugdhp03/flink/cluster_one/blob/job_4e9a5a9d70ca99dbd394c35f8dfeda65/blob_p-fa5168561c98e3005a724cb817a1ec1a0b3bd3eb-03a884a908837dc8b5a387fb502afa2f
        at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1642)
        at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:521)
        at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
        at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:119)
        at 
org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:36)
        at 
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:102)
        at 
org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:84)
        at 
org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:506)
        at 
org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:457)
        at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java:430)
        at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
        at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerJob(BlobLibraryCacheManager.java:91)
        at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:131)
        ... 25 more

{quote}

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-10184
>                 URL: https://issues.apache.org/jira/browse/FLINK-10184
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.2
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

Reply via email to