[ 
https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9010:
---------------------------------
    Fix Version/s:     (was: 1.5.4)

> NoResourceAvailableException with FLIP-6 
> -----------------------------------------
>
>                 Key: FLINK-9010
>                 URL: https://issues.apache.org/jira/browse/FLINK-9010
>             Project: Flink
>          Issue Type: Bug
>          Components: ResourceManager
>    Affects Versions: 1.5.0
>            Reporter: Nico Kruber
>            Assignee: Nico Kruber
>            Priority: Major
>              Labels: flip-6
>             Fix For: 1.6.1, 1.7.0, 1.5.5
>
>
> I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
> with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
> exception:
> {code}
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Received new container: 
> container_1521038088305_0257_01_000101 - Remaining pending container 
> requests: 302
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TaskExecutor container_1521038088305_0257_01_000101 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils                     
>               - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_000001/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
> 2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=<LOG_DIR>/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err
> 2018-03-16 03:41:20,176 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils                     
>               - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_000001/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
> 2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=<LOG_DIR>/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err
> 2018-03-16 03:41:20,203 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-1-233.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,713 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager 5fb7473a7738ef09e2c1fe8c5fc46e1e at the SlotManager.
> 2018-03-16 03:41:20,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:21,611 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager a078410d60d99351c0f54691c0beb5ed at the SlotManager.
> 2018-03-16 03:41:21,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:21,972 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager 6980e6ba9ce4945c7b2e0ede5130c7dc at the SlotManager.
> 2018-03-16 03:41:22,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:23,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:24,882 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager f7401aa710e890b811de8e415f34a61b at the SlotManager.
> 2018-03-16 03:41:24,883 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Replacing old instance of worker for ResourceID 
> container_1521038088305_0257_01_000041
> 2018-03-16 03:41:24,883 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - 
> Unregister TaskManager f7401aa710e890b811de8e415f34a61b from the SlotManager.
> 2018-03-16 03:41:24,883 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager e89e020bc7ebccf07849e326b08b6b73 at the SlotManager.
> 2018-03-16 03:41:24,884 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - The target with resource ID 
> container_1521038088305_0257_01_000041 is already been monitored.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 301.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 302.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 303.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 304.
> 2018-03-16 03:41:24,937 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 305.
> 2018-03-16 03:41:24,938 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 306.
> 2018-03-16 03:41:24,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:24,938 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 307.
> 2018-03-16 03:41:24,939 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 308.
> 2018-03-16 03:41:25,255 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager a075d154164bab5500f42a0aad7312ad at the SlotManager.
> ...
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate all requires slots within timeout of 300000 ms. Slots 
> required: 800, slots allocated: 792
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$2(ExecutionGraph.java:997)
>       at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>       at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:517)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:755)
>       at akka.dispatch.OnComplete.internal(Future.scala:258)
>       at akka.dispatch.OnComplete.internal(Future.scala:256)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>       at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
>       at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
>       at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>       at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>       at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>       at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to