Re: failed to submit Job to Flink standalone ZooKeeper-HA-cluster

Matthias Pohl Fri, 12 Jul 2024 01:59:28 -0700

Hi love_h1...@126.com,
Thanks for reaching out to the Flink community. Just a few general remarks:
- Flink's Jira [1] should be used to report potential bugs. The dev mailing
list is used for design and community-related discussions.
- It's also useful to provide not only snippets of the logs but a broader
extract. That way you avoid to skip log messages that might be important as
well.
- It appears that you're using a quite old Flink version (1.11.6). The
community agreed to only offer support for the two most-recently published
minor versions (i.e. 1.19 and 1.18). Investigating issues of older versions
is quite tedious. We encourage everyone to use recent versions of Apache
Flink.


Best,
Matthias

[1] https://issues.apache.org/jira/projects/FLINK/summary

On Thu, Jul 11, 2024 at 4:13 PM love_h1...@126.com <love_h1...@126.com>
wrote:

>
> I have a standalone ZooKeeper-HA-cluster.
> First restart ZK cluster.
> Second Submit 31 Jobs with Command-Line , `./bin/flink run`
> Then I caught an exception. Specifically, there were 4 jobs that failed to
> submit.
>
> ```
> 18:55:20,243 - [Flink-DispatcherRestEndpoint-thread-2] - [WARN ] -
> [org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(line:290)]
> - Could not create remote rpc invocation message. Failing rpc invocation
> because...
> java.io.IOException: The rpc invocation size 12532388 exceeds the maximum
> akka framesize.
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(AkkaInvocationHandler.java:283)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invokeRpc(AkkaInvocationHandler.java:210)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invoke(AkkaInvocationHandler.java:133)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaInvocationHandler.invoke(FencedAkkaInvocationHandler.java:88)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at com.sun.proxy.$Proxy36.submitJob(Unknown Source) ~[?:?]
> at
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$handleRequest$0(JobSubmitHandler.java:123)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> [?:?]
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> [?:?]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
> [?:?]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
> 18:55:20,244 - [Flink-DispatcherRestEndpoint-thread-2] - [ERROR] -
> [org.apache.flink.runtime.rest.handler.AbstractHandler.handleException(line:257)]
> - Unhandled exception.
> java.lang.reflect.UndeclaredThrowableException: null
> at com.sun.proxy.$Proxy36.submitJob(Unknown Source) ~[?:?]
> at
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$handleRequest$0(JobSubmitHandler.java:123)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> [?:?]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
> [?:?]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.io.IOException: The rpc invocation size 12532388 exceeds
> the maximum akka framesize.
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(AkkaInvocationHandler.java:283)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invokeRpc(AkkaInvocationHandler.java:210)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invoke(AkkaInvocationHandler.java:133)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaInvocationHandler.invoke(FencedAkkaInvocationHandler.java:88)
> ~[flink-dist_2.12-1.11.6.jar:1.11.6]
> ... 11 more
> ```
>
> The logs showed some similarities between node A and node B. I believe
> that Some jobs have been submitted to Node A, whereas some other jobs have
> been submitted to Node B.
> ```
>    18:54:51,441 - [main-EventThread] - [INFO ] -
> [org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMaster(line:321)]
> - JobManager runner for job PrepareMblQuotEvent
> (dfdcb0d6f20c83b0fe99473f3103278a) was granted leadership with session id
> c11603ca-1313-4066-9a21-51e02f0df057 at akka.tcp://flink@Node_A
> :39715/user/rpc/jobmanager_2.
> ```
> ```
>    18:54:51,632 - [main-EventThread] - [INFO ] -
> [org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMaster(line:321)]
> - JobManager runner for job PrepareErrOrderEvent
> (5918fedb51dd000c017db9cb381fbb1c) was granted leadership with session id
> 85aef144-05c2-473d-b479-0c8c74d089f9 at akka.tcp://flink@Node_B
> :46746/user/rpc/jobmanager_5.
> ```
>
> The logs of Node A indicate that this node is the leader node.
> ```
> 17:19:45,433 - [flink-akka.actor.default-dispatcher-22] - [INFO ] -
> [org.apache.flink.runtime.resourcemanager.ResourceManager.tryAcceptLeadership(line:1118)]
> - ResourceManager akka.tcp://flink@Node_A:46746/user/rpc/resourcemanager_0
> was granted leadership with fencing token ad84d46e902e0cf6da92179447af4e00
>
> 17:19:45,434 - [main-EventThread] - [INFO ] -
> [org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.grantLeadership(line:931)]
> - http://Node_A:xxx was granted leadership with
> leaderSessionID=f60df688-372d-416b-a965-989a59b37feb
>
> 17:19:45,437 - [flink-akka.actor.default-dispatcher-22] - [INFO ] -
> [org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.start(line:287)]
> - Starting the SlotManager.
>
> 17:19:45,480 - [main-EventThread] - [INFO ] -
> [org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.startInternal(line:97)]
> - Start SessionDispatcherLeaderProcess.
>
> 17:19:45,489 - [cluster-io-thread-1] - [INFO ] -
> [org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(line:232)] -
> Starting RPC endpoint for
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
> akka://flink/user/rpc/dispatcher_1 .
>
> 17:19:45,495 - [flink-akka.actor.default-dispatcher-23] - [INFO ] -
> [org.apache.flink.runtime.resourcemanager.ResourceManager.registerTaskExecutorInternal(line:891)]
> - Registering TaskManager with ResourceID XXXXXX 
> (akka.tcp://flink@X.X.X:XX/user/rpc/taskmanager_0)
> at ResourceManager
> ```
>
> The logs of Nodes A, B, C, and D all contain the following information
> `Start SessionDispatcherLeaderProcess`, but only Nodes C and D have the
> additional information `Stopping SessionDispatcherLeaderProcess`.
> ```
> 17:19:44,577 - [main-EventThread] - [INFO ] -
> [org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.startInternal(line:97)]
> - Start SessionDispatcherLeaderProcess.
>
> 17:19:44,580 - [Curator-ConnectionStateManager-0] - [INFO ] -
> [org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.closeInternal(line:134)]
> - Stopping SessionDispatcherLeaderProcess.
>
> 17:19:44,590 - [cluster-io-thread-3] - [INFO ] -
> [org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(line:232)] -
> Starting RPC endpoint for
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
> akka://flink/user/rpc/dispatcher_1 .
> ```
>
> I suspect that Node B mistakenly believes itself to be the leader. Nodes A
> and B are simultaneously modifying the content of a certain
> node(`rest_server_lock`) in Zookeeper.
>
> Whether this is a bug of Flink.
>
>
>
>
>
>
>

Re: failed to submit Job to Flink standalone ZooKeeper-HA-cluster

Reply via email to