[ 
https://issues.apache.org/jira/browse/FLINK-24692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440358#comment-17440358
 ] 

Yang Wang commented on FLINK-24692:
-----------------------------------

It seems that the registered TaskManager becomes idle too fast. Could you 
please increase {{resourcemanager.taskmanager-timeout}} to 120000 and try again.

> kubernetes session mode deployment failed since slot allocation timeout
> -----------------------------------------------------------------------
>
>                 Key: FLINK-24692
>                 URL: https://issues.apache.org/jira/browse/FLINK-24692
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.11.2
>            Reporter: Zhou Parker
>            Priority: Major
>         Attachments: jobmanager_log.txt
>
>
> Kubernetes: 1.15
> Flink: 1.11.2
>  
> When submit {{TopSpeedWindowing demo with session mode on k8s. Job failed.}}
> {{}}
> {{log from JM:}}
>  
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate the required slot within slot request timeout. Please make 
> sure that the cluster has enough resources.
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>     ... 45 more
> Caused by: java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>     at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  ~[?:1.8.0_275]
>     at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  ~[?:1.8.0_275]
>     at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) 
> ~[?:1.8.0_275]
>     at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  ~[?:1.8.0_275]
>     ... 25 more
> Caused by: java.util.concurrent.TimeoutException
>     ... 23 more
>  
> Log from TM:
>  
> 2021-10-29 06:54:22,862 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService 
> [] - Starting RPC endpoint for 
> org.apache.flink.runtime.taskexecutor.TaskExecutor at 
> akka://flink/user/rpc/taskmanager_0 .
> 2021-10-29 06:54:22,875 INFO 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Start job 
> leader service.
> 2021-10-29 06:54:22,877 INFO org.apache.flink.runtime.filecache.FileCache [] 
> - User file cache uses directory 
> /tmp/flink-dist-cache-7fb5ad02-77e1-4942-8ab6-3e10347664c4
> 2021-10-29 06:54:22,935 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to 
> ResourceManager 
> akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
> 2021-10-29 06:54:22,940 DEBUG 
> org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to 
> remote RPC endpoint with address 
> akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*. Returning a 
> org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
> 2021-10-29 06:54:23,265 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Resolved 
> ResourceManager address, beginning registration
> 2021-10-29 06:54:23,265 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Registration at 
> ResourceManager attempt 1 (timeout=100ms)
> 2021-10-29 06:54:23,391 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful 
> registration at resource manager 
> akka.tcp://[email protected]:6123/user/rpc/resourcemanager_* under 
> registration id dca9eaff5da556d2b99bd447a07538b7.
> 2021-10-29 06:54:23,456 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot request 
> 190c5be552e5aed60834096b6e1efc2f for job f5680609a3e78061e63e97268e1860c6 
> from resource manager with leader id 00000000000000000000000000000000.
> 2021-10-29 06:54:23,462 DEBUG org.apache.flink.runtime.memory.MemoryManager 
> [] - Initialized MemoryManager with total memory size 536870920 and page size 
> 32768.
> 2021-10-29 06:54:23,464 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Allocated slot for 
> 190c5be552e5aed60834096b6e1efc2f.
> 2021-10-29 06:54:23,465 INFO 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job 
> f5680609a3e78061e63e97268e1860c6 for job leader monitoring.
> 2021-10-29 06:54:23,466 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - New leader 
> information for job f5680609a3e78061e63e97268e1860c6. Address: 
> akka.tcp://[email protected]:6123/user/rpc/jobmanager_2, leader id: 
> 00000000000000000000000000000000.
> 2021-10-29 06:54:23,467 INFO 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Try to 
> register at job manager 
> akka.tcp://[email protected]:6123/user/rpc/jobmanager_2 with leader id 
> 00000000-0000-0000-0000-000000000000.
> 2021-10-29 06:54:23,468 DEBUG 
> org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to 
> remote RPC endpoint with address 
> akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. Returning a 
> org.apache.flink.runtime.jobmaster.JobMasterGateway gateway.
> 2021-10-29 06:54:23,541 INFO 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Resolved 
> JobManager address, beginning registration
> 2021-10-29 06:54:23,542 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager attempt 1 (timeout=100ms)
> 2021-10-29 06:54:23,660 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager 
> (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2) attempt 1 timed 
> out after 100 ms
> 2021-10-29 06:54:23,660 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager attempt 2 (timeout=200ms)
> 2021-10-29 06:54:23,878 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager 
> (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2) attempt 2 timed 
> out after 200 ms
> 2021-10-29 06:54:23,879 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager attempt 3 (timeout=400ms)
> 2021-10-29 06:54:24,299 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager 
> (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2) attempt 3 timed 
> out after 400 ms
> 2021-10-29 06:54:24,299 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager attempt 4 (timeout=800ms)
> 2021-10-29 06:54:25,118 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager 
> (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2) attempt 4 timed 
> out after 800 ms
> 2021-10-29 06:54:25,119 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager attempt 5 (timeout=1600ms)
> 2021-10-29 06:54:26,603 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat 
> request from 8edb8ed60a1b18ffb9913e3d01670115.
> 2021-10-29 06:54:26,739 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager 
> (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2) attempt 5 timed 
> out after 1600 ms
> 2021-10-29 06:54:26,739 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager attempt 6 (timeout=3200ms)
> 2021-10-29 06:54:29,958 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager 
> (akka.tcp://[email protected]:6123/user/rpc/jobmanager_2) attempt 6 timed 
> out after 3200 ms
> 2021-10-29 06:54:29,959 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - 
> Registration at JobManager attempt 7 (timeout=6400ms)
> 2021-10-29 06:54:33,465 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Free slot with 
> allocation id 190c5be552e5aed60834096b6e1efc2f because: The slot 
> 190c5be552e5aed60834096b6e1efc2f has timed out.
> 2021-10-29 06:54:33,466 DEBUG 
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
> TaskSlot(index:0, state:ALLOCATED, resource profile: 
> ResourceProfile\{cpuCores=1.0000000000000000, taskHeapMemory=384.000mb 
> (402653174 bytes), taskOffHeapMemory=0 bytes, managedMemory=512.000mb 
> (536870920 bytes), networkMemory=128.000mb (134217730 bytes)}, allocationId: 
> 190c5be552e5aed60834096b6e1efc2f, jobId: f5680609a3e78061e63e97268e1860c6).
> java.lang.Exception: The slot 190c5be552e5aed60834096b6e1efc2f has timed out.
>  at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.timeoutSlot(TaskExecutor.java:1653)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$2800(TaskExecutor.java:173)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$SlotActionsImpl.lambda$timeoutSlot$1(TaskExecutor.java:1940)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  [flink-dist_2.11-1.11.2.jar:1.11.2]
> 2021-10-29 06:54:33,471 INFO 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job 
> f5680609a3e78061e63e97268e1860c6 from job leader monitoring.
> 2021-10-29 06:54:33,471 DEBUG 
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Retrying 
> registration towards akka.tcp://[email protected]:6123/user/rpc/jobmanager_2 
> was cancelled.
> 2021-10-29 06:54:33,472 DEBUG 
> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - 
> Releasing local state under allocation id 190c5be552e5aed60834096b6e1efc2f.
> 2021-10-29 06:54:36,622 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat 
> request from 8edb8ed60a1b18ffb9913e3d01670115.
> 2021-10-29 06:54:46,642 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat 
> request from 8edb8ed60a1b18ffb9913e3d01670115.
> 2021-10-29 06:54:56,662 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat 
> request from 8edb8ed60a1b18ffb9913e3d01670115.
> 2021-10-29 06:55:06,616 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close ResourceManager 
> connection 8edb8ed60a1b18ffb9913e3d01670115.
> org.apache.flink.util.FlinkException: TaskExecutor exceeded the idle timeout.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.releaseTaskExecutor(SlotManagerImpl.java:1258)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.lambda$releaseTaskExecutorIfPossible$14(SlotManagerImpl.java:1251)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) 
> ~[?:1.8.0_275]
>  at 
> java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
>  ~[?:1.8.0_275]
>  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_275]
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>  ~[flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:517) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> [flink-dist_2.11-1.11.2.jar:1.11.2]
>  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  [flink-dist_2.11-1.11.2.jar:1.11.2]
> 2021-10-29 06:55:06,622 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to 
> ResourceManager 
> akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
> 2021-10-29 06:55:06,623 DEBUG 
> org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to 
> remote RPC endpoint with address 
> akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*. Returning a 
> org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
> 2021-10-29 06:55:06,631 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Resolved 
> ResourceManager address, beginning registration
> 2021-10-29 06:55:06,631 DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Registration at 
> ResourceManager attempt 1 (timeout=100ms)
> 2021-10-29 06:55:06,636 INFO 
> org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner [] - 
> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
> 2021-10-29 06:55:06,638 INFO org.apache.flink.runtime.blob.TransientBlobCache 
> [] - Shutting down BLOB cache
> 2021-10-29 06:55:06,639 DEBUG 
> org.apache.flink.runtime.io.disk.iomanager.IOManager [] - Shutting down I/O 
> manager.
> 2021-10-29 06:55:06,640 INFO org.apache.flink.runtime.filecache.FileCache [] 
> - removed file cache directory 
> /tmp/flink-dist-cache-7fb5ad02-77e1-4942-8ab6-3e10347664c4
> 2021-10-29 06:55:06,641 INFO org.apache.flink.runtime.blob.PermanentBlobCache 
> [] - Shutting down BLOB cache
> 2021-10-29 06:55:06,643 INFO 
> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - 
> Shutting down TaskExecutorLocalStateStoresManager.
> 2021-10-29 06:55:06,645 INFO 
> org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - 
> FileChannelManager removed spill file directory 
> /tmp/flink-io-66cad1f9-ce74-4c01-a02b-32d2e11dcb5a
> 2021-10-29 06:55:06,646 INFO 
> org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - 
> FileChannelManager removed spill file directory 
> /tmp/flink-netty-shuffle-bbc6e6a4-9973-48a5-83b1-3ef94d8605f3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to