Zhou Parker created FLINK-24692:
-----------------------------------
Summary: kubernetes session mode deployment failed since slot
allocation timeout
Key: FLINK-24692
URL: https://issues.apache.org/jira/browse/FLINK-24692
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes
Affects Versions: 1.11.2
Reporter: Zhou Parker
Kubernetes: 1.15
Flink: 1.11.2
When submit {{TopSpeedWindowing demo with session mode on k8s. Job failed.}}
{{}}
{{log from JM:}}
Caused by:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate the required slot within slot request timeout. Please make
sure that the cluster has enough resources.
at
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
... 45 more
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_275]
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_275]
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
~[?:1.8.0_275]
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
~[?:1.8.0_275]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more
Log from TM:
2021-10-29 06:54:22,862 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
[] - Starting RPC endpoint for
org.apache.flink.runtime.taskexecutor.TaskExecutor at
akka://flink/user/rpc/taskmanager_0 .
2021-10-29 06:54:22,875 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Start job
leader service.
2021-10-29 06:54:22,877 INFO org.apache.flink.runtime.filecache.FileCache [] -
User file cache uses directory
/tmp/flink-dist-cache-7fb5ad02-77e1-4942-8ab6-3e10347664c4
2021-10-29 06:54:22,935 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Connecting to ResourceManager
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2021-10-29 06:54:22,940 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService
[] - Try to connect to remote RPC endpoint with address
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*. Returning a
org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
2021-10-29 06:54:23,265 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Resolved ResourceManager address, beginning registration
2021-10-29 06:54:23,265 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Registration at
ResourceManager attempt 1 (timeout=100ms)
2021-10-29 06:54:23,391 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Successful registration at resource manager
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_* under
registration id dca9eaff5da556d2b99bd447a07538b7.
2021-10-29 06:54:23,456 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Receive slot request 190c5be552e5aed60834096b6e1efc2f for job
f5680609a3e78061e63e97268e1860c6 from resource manager with leader id
00000000000000000000000000000000.
2021-10-29 06:54:23,462 DEBUG org.apache.flink.runtime.memory.MemoryManager []
- Initialized MemoryManager with total memory size 536870920 and page size
32768.
2021-10-29 06:54:23,464 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Allocated slot for 190c5be552e5aed60834096b6e1efc2f.
2021-10-29 06:54:23,465 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job
f5680609a3e78061e63e97268e1860c6 for job leader monitoring.
2021-10-29 06:54:23,466 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - New leader
information for job f5680609a3e78061e63e97268e1860c6. Address:
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2, leader id:
00000000000000000000000000000000.
2021-10-29 06:54:23,467 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Try to
register at job manager
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2 with leader id
00000000-0000-0000-0000-000000000000.
2021-10-29 06:54:23,468 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService
[] - Try to connect to remote RPC endpoint with address
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. Returning a
org.apache.flink.runtime.jobmaster.JobMasterGateway gateway.
2021-10-29 06:54:23,541 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Resolved
JobManager address, beginning registration
2021-10-29 06:54:23,542 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager attempt 1 (timeout=100ms)
2021-10-29 06:54:23,660 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2)
attempt 1 timed out after 100 ms
2021-10-29 06:54:23,660 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager attempt 2 (timeout=200ms)
2021-10-29 06:54:23,878 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2)
attempt 2 timed out after 200 ms
2021-10-29 06:54:23,879 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager attempt 3 (timeout=400ms)
2021-10-29 06:54:24,299 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2)
attempt 3 timed out after 400 ms
2021-10-29 06:54:24,299 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager attempt 4 (timeout=800ms)
2021-10-29 06:54:25,118 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2)
attempt 4 timed out after 800 ms
2021-10-29 06:54:25,119 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager attempt 5 (timeout=1600ms)
2021-10-29 06:54:26,603 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat
request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:54:26,739 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2)
attempt 5 timed out after 1600 ms
2021-10-29 06:54:26,739 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager attempt 6 (timeout=3200ms)
2021-10-29 06:54:29,958 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2)
attempt 6 timed out after 3200 ms
2021-10-29 06:54:29,959 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration
at JobManager attempt 7 (timeout=6400ms)
2021-10-29 06:54:33,465 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Free slot with
allocation id 190c5be552e5aed60834096b6e1efc2f because: The slot
190c5be552e5aed60834096b6e1efc2f has timed out.
2021-10-29 06:54:33,466 DEBUG
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot
TaskSlot(index:0, state:ALLOCATED, resource profile:
ResourceProfile\{cpuCores=1.0000000000000000, taskHeapMemory=384.000mb
(402653174 bytes), taskOffHeapMemory=0 bytes, managedMemory=512.000mb
(536870920 bytes), networkMemory=128.000mb (134217730 bytes)}, allocationId:
190c5be552e5aed60834096b6e1efc2f, jobId: f5680609a3e78061e63e97268e1860c6).
java.lang.Exception: The slot 190c5be552e5aed60834096b6e1efc2f has timed out.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.timeoutSlot(TaskExecutor.java:1653)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$2800(TaskExecutor.java:173)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$SlotActionsImpl.lambda$timeoutSlot$1(TaskExecutor.java:1940)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.2.jar:1.11.2]
2021-10-29 06:54:33,471 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job
f5680609a3e78061e63e97268e1860c6 from job leader monitoring.
2021-10-29 06:54:33,471 DEBUG
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Retrying
registration towards akka.tcp://[email protected]:6123/user/rpc/jobmanager_2
was cancelled.
2021-10-29 06:54:33,472 DEBUG
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] -
Releasing local state under allocation id 190c5be552e5aed60834096b6e1efc2f.
2021-10-29 06:54:36,622 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat
request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:54:46,642 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat
request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:54:56,662 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat
request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:55:06,616 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close ResourceManager
connection 8edb8ed60a1b18ffb9913e3d01670115.
org.apache.flink.util.FlinkException: TaskExecutor exceeded the idle timeout.
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.releaseTaskExecutor(SlotManagerImpl.java:1258)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.lambda$releaseTaskExecutorIfPossible$14(SlotManagerImpl.java:1251)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
~[?:1.8.0_275]
at
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
~[?:1.8.0_275]
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
~[?:1.8.0_275]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.2.jar:1.11.2]
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.2.jar:1.11.2]
2021-10-29 06:55:06,622 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Connecting to ResourceManager
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2021-10-29 06:55:06,623 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService
[] - Try to connect to remote RPC endpoint with address
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*. Returning a
org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
2021-10-29 06:55:06,631 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
[] - Resolved ResourceManager address, beginning registration
2021-10-29 06:55:06,631 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Registration at
ResourceManager attempt 1 (timeout=100ms)
2021-10-29 06:55:06,636 INFO
org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner [] -
RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2021-10-29 06:55:06,638 INFO org.apache.flink.runtime.blob.TransientBlobCache
[] - Shutting down BLOB cache
2021-10-29 06:55:06,639 DEBUG
org.apache.flink.runtime.io.disk.iomanager.IOManager [] - Shutting down I/O
manager.
2021-10-29 06:55:06,640 INFO org.apache.flink.runtime.filecache.FileCache [] -
removed file cache directory
/tmp/flink-dist-cache-7fb5ad02-77e1-4942-8ab6-3e10347664c4
2021-10-29 06:55:06,641 INFO org.apache.flink.runtime.blob.PermanentBlobCache
[] - Shutting down BLOB cache
2021-10-29 06:55:06,643 INFO
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] -
Shutting down TaskExecutorLocalStateStoresManager.
2021-10-29 06:55:06,645 INFO
org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager
removed spill file directory /tmp/flink-io-66cad1f9-ce74-4c01-a02b-32d2e11dcb5a
2021-10-29 06:55:06,646 INFO
org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager
removed spill file directory
/tmp/flink-netty-shuffle-bbc6e6a4-9973-48a5-83b1-3ef94d8605f3
--
This message was sent by Atlassian Jira
(v8.3.4#803005)