[jira] [Commented] (FLINK-27245) Flink job on Yarn cannot revover when zookeeper in Exception

Zhanghao Chen (Jira) Thu, 14 Apr 2022 02:00:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-27245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522171#comment-17522171
 ]


Zhanghao Chen commented on FLINK-27245:
---------------------------------------

If you maintain an internal version of Flink 1.7.2, it's an easy 1-line fix by 
changing the error handling strategy of curator, check out more at 
https://curator.apache.org/errors.html.

> Flink job on Yarn cannot revover when zookeeper in Exception
> ------------------------------------------------------------
>
>                 Key: FLINK-27245
>                 URL: https://issues.apache.org/jira/browse/FLINK-27245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.7.2
>         Environment: Flink :1.7.2
> Hdfs:3.1.1
> zookeeper:3.5.1
> HA defined in Flink-conf,yaml:
> flink.security.enable: true
> fs.output.always-create-directory: false
> fs.overwrite-files: false
> high-availability.job.delay: 10 s
> high-availability.storageDir: hdfs:///flink/recovery
> high-availability.zookeeper.client.acl: creator
> high-availability.zookeeper.client.connection-timeout: 15000
> high-availability.zookeeper.client.max-retry-attempts: 3
> high-availability.zookeeper.client.retry-wait: 5000
> high-availability.zookeeper.client.session-timeout: 60000
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.quorum: zk01:24002,zk02:24002,zk03:24002
> high-availability: zookeeper
>            Reporter: hjw
>            Priority: Major
>         Attachments: Job-failed.txt, Job-recover-failed.txt, 
> zookeeper-omm-server-a-dsj-ghficn01.2022-04-07_20-09-25.[1].log
>
>
> Flink job cannot revover  when zookeeper in Exception.
> I noticed that the data in high-availability.storageDir deleled  when Job 
> failed , resulting in failure when pulling up again.
> {code:java}
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:29,002 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:29,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:29,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,002 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,002 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,004 | INFO  | [Suspend state waiting handler] | 
> Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 
> seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch 
> (SmarterLeaderLatch.java:570) 
> 2022-04-07 19:54:30,769 | INFO  | [BlobServer shutdown hook] | 
> FileSystemBlobStore cleaning up 
> hdfs:/flink/recovery/application_1625720467511_45233. | 
> org.apache.flink.runtime.blob.FileSystemBlobStor
> {code}
> {code:java}
> 2022-04-07 19:55:29,452 | INFO  | [flink-akka.actor.default-dispatcher-4] | 
> Recovered SubmittedJobGraph(1898637f2d11429bd5f5767ea1daaf79, null). | 
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore 
> (ZooKeeperSubmittedJobGraphStore.java:215) 
> 2022-04-07 19:55:29,467 | ERROR | [flink-akka.actor.default-dispatcher-17] | 
> Fatal error occurred in the cluster entrypoint. | 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint 
> (ClusterEntrypoint.java:408) 
> java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobExecutionException: Could not set up 
> JobManager
>       at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
>       at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>       at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not 
> set up JobManager
>       at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)
>       at 
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
>       ... 7 common frames omitted
> Caused by: java.lang.Exception: Cannot set up the user code libraries: File 
> does not exist: 
> /flink/recovery/application_1625720467511_45233/blob/job_1898637f2d11429bd5f5767ea1daaf79/blob_p-7128d0ae4a06a277e3b1182c99eb616ffd45b590-c90586d4a5d4641fcc0c9e4cab31c131
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:153)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1951)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:742)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:439)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
> {code}
> I get the  log of  the job failed when zookeeper happend error ,try to 
> restart job manager by yarn and zookeeper .
> Error happended in 2022/04/07 19:54
> BTW,Where can I learn about the implementation and principle of Flink HA.
> thx



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-27245) Flink job on Yarn cannot revover when zookeeper in Exception

Reply via email to