[jira] [Commented] (FLINK-27245) Flink job on Yarn cannot revover when zookeeper in Exception

Matthias Pohl (Jira) Thu, 14 Apr 2022 02:35:26 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-27245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522188#comment-17522188
 ]


Matthias Pohl commented on FLINK-27245:
---------------------------------------

Looking at the stacktrace, I'm not even sure whether that's not expected 
behavior. The job initially fails fatally because of an authentication failure. 
Hence, the whole Flink cluster and not only the job ends up in an inconsistent 
state which should be hard to recover from...

> Flink job on Yarn cannot revover when zookeeper in Exception
> ------------------------------------------------------------
>
>                 Key: FLINK-27245
>                 URL: https://issues.apache.org/jira/browse/FLINK-27245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.7.2
>         Environment: Flink :1.7.2
> Hdfs:3.1.1
> zookeeper:3.5.1
> HA defined in Flink-conf,yaml:
> flink.security.enable: true
> fs.output.always-create-directory: false
> fs.overwrite-files: false
> high-availability.job.delay: 10 s
> high-availability.storageDir: hdfs:///flink/recovery
> high-availability.zookeeper.client.acl: creator
> high-availability.zookeeper.client.connection-timeout: 15000
> high-availability.zookeeper.client.max-retry-attempts: 3
> high-availability.zookeeper.client.retry-wait: 5000
> high-availability.zookeeper.client.session-timeout: 60000
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.quorum: zk01:24002,zk02:24002,zk03:24002
> high-availability: zookeeper
>            Reporter: hjw
>            Priority: Major
>         Attachments: Job-failed.txt, Job-recover-failed.txt, 
> zookeeper-omm-server-a-dsj-ghficn01.2022-04-07_20-09-25.[1].log
>
>
> Flink job cannot revover  when zookeeper in Exception.
> I noticed that the data in high-availability.storageDir deleled  when Job 
> failed , resulting in failure when pulling up again.
> {code:java}
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:29,002 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:29,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:29,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,002 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,002 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,769 | INFO  | [BlobServer shutdown hook] | 
> FileSystemBlobStore cleaning up 
> hdfs:/flink/recovery/application_1625720467511_45233. | 
> org.apache.flink.runtime.blob.FileSystemBlobStor
> {code}
> {code:java}
> 2022-04-07 19:55:29,452 | INFO  | [flink-akka.actor.default-dispatcher-4] | 
> Recovered SubmittedJobGraph(1898637f2d11429bd5f5767ea1daaf79, null). | 
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore 
> (ZooKeeperSubmittedJobGraphStore.java:215) 
> 2022-04-07 19:55:29,467 | ERROR | [flink-akka.actor.default-dispatcher-17] | 
> Fatal error occurred in the cluster entrypoint. | 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint 
> (ClusterEntrypoint.java:408) 
> java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobExecutionException: Could not set up 
> JobManager
>       at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
>       at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>       at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not 
> set up JobManager
>       at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)
>       at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
>       ... 7 common frames omitted
> Caused by: java.lang.Exception: Cannot set up the user code libraries: File 
> does not exist: 
> /flink/recovery/application_1625720467511_45233/blob/job_1898637f2d11429bd5f5767ea1daaf79/blob_p-7128d0ae4a06a277e3b1182c99eb616ffd45b590-c90586d4a5d4641fcc0c9e4cab31c131
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:153)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1951)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:742)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:439)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
> {code}
> I get the  log of  the job failed when zookeeper happend error ,try to 
> restart job manager by yarn and zookeeper .
> Error happended in 2022/04/07 19:54
> BTW,Where can I learn about the implementation and principle of Flink HA.
> thx



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-27245) Flink job on Yarn cannot revover when zookeeper in Exception

Reply via email to