[jira] [Commented] (FLINK-10617) Restoring job fails because of slot allocation timeout

Sharon Xie (Jira) Wed, 16 Oct 2019 12:53:12 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953145#comment-16953145
 ]


Sharon Xie commented on FLINK-10617:
------------------------------------

I'm seeing the similar  issue with Flink 1.5.5.

 

We had 90 TMs with 6 task slots on each TM deployed in a K8S cluster. Under 
normal situation, ~480 slots are used running about 90 jobs

The symptom is:
 * JM restarted and all the jobs are being recovered
 * JM reported no slots available and we found that no slots were released when 
the jobs were restarted.
 * Looking at the log, a few jobs reported logs like: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms. Slots 
required: 100, slots allocated: 73 (the same job reported the same number of 
slots allocated but was never able to move beyond that). This looked like the 
system stuck in a live lock.
 * We increased the TM replica count to 115 and there were 690 total task 
slots. After this, a few jobs were recovered because of the change. However, we 
still didn't see any used slots being released. There were still a few jobs 
stuck in the live lock.
 * We edited the TM deployment, which essentially created a new replica set for 
TMs and then the issue was resolved. The number of current used slots returned 
to ~480.

> Restoring job fails because of slot allocation timeout
> ------------------------------------------------------
>
>                 Key: FLINK-10617
>                 URL: https://issues.apache.org/jira/browse/FLINK-10617
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Elias Levy
>            Priority: Critical
>
> The following may be related to FLINK-9932, but I am unsure.  If you believe 
> it is, go ahead and close this issue and a duplicate.
> While trying to test local state recovery on a job with large state, the job 
> failed to be restored because slot allocation timed out.
> The job is running on a standalone cluster with 12 nodes and 96 task slots (8 
> per node).  The job has parallelism of 96, so it consumes all of the slots, 
> and has ~200 GB of state in RocksDB.  
> To test local state recovery I decided to kill one of the TMs.  The TM 
> immediately restarted and re-registered with the JM.  I confirmed the JM 
> showed 96 registered task slots.
> {noformat}
> 21:35:44,616 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Resolved ResourceManager address, beginning registration
> 21:35:44,616 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Registration at ResourceManager attempt 1 (timeout=100ms)
> 21:35:44,640 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Successful registration at resource manager 
> akka.tcp://[email protected]:6123/user/resourcemanager under registration 
> id 302988dea6afbd613bb2f96429b65d18.
> 21:36:49,667 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{4274d96a59d370305520876f5b84fb9f} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,667 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{4274d96a59d370305520876f5b84fb9f}.
> 21:36:49,667 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,668 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,671 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Try to register at job manager 
> akka.tcp://[email protected]:6123/user/jobmanager_3 with leader id 
> f85f6f9b-7713-4be3-a8f0-8443d91e5e6d.
> 21:36:49,681 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{3a64e2c8c5b22adbcfd3ffcd2b49e7f9} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,681 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{3a64e2c8c5b22adbcfd3ffcd2b49e7f9}.
> 21:36:49,681 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,681 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,681 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,683 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Try to register at job manager 
> akka.tcp://[email protected]:6123/user/jobmanager_3 with leader id 
> f85f6f9b-7713-4be3-a8f0-8443d91e5e6d.
> 21:36:49,687 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Resolved JobManager address, beginning registration
> 21:36:49,687 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Resolved JobManager address, beginning registration
> 21:36:49,687 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{740caf20a5f7f767864122dc9a7444d9} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,688 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Registration at JobManager attempt 1 (timeout=100ms)
> 21:36:49,688 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{740caf20a5f7f767864122dc9a7444d9}.
> 21:36:49,688 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,688 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,688 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,689 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{2ca95a9b9ccbd23d235b338d9aff7e56} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,689 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{2ca95a9b9ccbd23d235b338d9aff7e56}.
> 21:36:49,689 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,689 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,689 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,694 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{0521fab5d106362671db3b18031685a7} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,694 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{0521fab5d106362671db3b18031685a7}.
> 21:36:49,694 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Try to register at job manager 
> akka.tcp://[email protected]:6123/user/jobmanager_3 with leader id 
> f85f6f9b-7713-4be3-a8f0-8443d91e5e6d.
> 21:36:49,694 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,694 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,695 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,696 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{f88e958c2c13a27f6ebaca68892c6554} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,696 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{f88e958c2c13a27f6ebaca68892c6554}.
> 21:36:49,696 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,696 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,696 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,698 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{229f7519d895335cff7b577364d3f034} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,698 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{229f7519d895335cff7b577364d3f034}.
> 21:36:49,698 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,698 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,698 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,699 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Resolved JobManager address, beginning registration
> 21:36:49,699 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Receive slot request AllocationID{98341da2fd62db5e0a775dd9196a522e} for 
> job 87c61e8ee64cdbd50f191d39610eb58f from resource manager with leader id 
> 8e06aa64d5f8961809da38fe7f224cc1.
> 21:36:49,700 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Allocated slot for AllocationID{98341da2fd62db5e0a775dd9196a522e}.
> 21:36:49,700 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Add job 87c61e8ee64cdbd50f191d39610eb58f for job leader monitoring.
> 21:36:49,700 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,700 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Starting ZooKeeperLeaderRetrievalService 
> /leader/87c61e8ee64cdbd50f191d39610eb58f/job_manager_lock.
> 21:36:49,703 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Try to register at job manager 
> akka.tcp://[email protected]:6123/user/jobmanager_3 with leader id 
> f85f6f9b-7713-4be3-a8f0-8443d91e5e6d.
> 21:36:49,706 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Resolved JobManager address, beginning registration
> 21:36:49,706 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Registration at JobManager attempt 1 (timeout=100ms)
> 21:36:49,708 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService     
>    - Successful registration at job manager 
> akka.tcp://[email protected]:6123/user/jobmanager_3 for job 
> 87c61e8ee64cdbd50f191d39610eb58f.
> 21:36:49,709 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Establish JobManager connection for job 87c61e8ee64cdbd50f191d39610eb58f.
> 21:36:49,712 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor         
>    - Offer reserved slots to the leader of job 
> 87c61e8ee64cdbd50f191d39610eb58f.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{229f7519d895335cff7b577364d3f034}.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{98341da2fd62db5e0a775dd9196a522e}.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{f88e958c2c13a27f6ebaca68892c6554}.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{0521fab5d106362671db3b18031685a7}.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{4274d96a59d370305520876f5b84fb9f}.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{3a64e2c8c5b22adbcfd3ffcd2b49e7f9}.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{2ca95a9b9ccbd23d235b338d9aff7e56}.
> 21:36:49,713 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable   
>    - Activate slot AllocationID{740caf20a5f7f767864122dc9a7444d9}.
> {noformat}
> Alas, the job failed to restore, timing out.  JM logs show it requests 8 
> slots, but that some timeout, causing the restore to fail:
> {noformat}
> 21:36:49,716 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{4e63f5ba519d83764a2e06611285d930}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,716 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{7fa61d06d579e3ac55456b46e7f6333e}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,716 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{f5b48cd43ba142fe90f73acc7e69ae76}.
> 21:36:49,716 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{a12f5de011daeb570b9afacf7d3241ab}.
> 21:36:49,716 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{a7197d88984291a7b89beda98ae351d4}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,717 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{352b4ea7d7bfe4f4910f5c40c96d1684}.
> 21:36:49,717 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{c88a96628d5a13e5ee14371f62f45866}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,717 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{e81eb2b9ef6bf9b5a6f2299b69328b80}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,717 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{dfa16526ccec6297ba6587d9fbd60993}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,717 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{10ad63aceecbb72a709f57b3a6f13437}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,719 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Requesting new slot [SlotRequestId{8462b3a4890330f261ab41208e863d00}] 
> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource 
> manager.
> 21:36:49,717 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{eff7d3b200a6c225fb3c49ab5d5fc5b4}.
> 21:36:49,719 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{54dea082f27a4c6848fd539292c78e83}.
> 21:36:49,719 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{e69fb15ef711f56b7582f8e507f30af2}.
> 21:36:49,719 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{8d2711dd73157f929263e08db873334f}.
> 21:36:49,719 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request 
> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 
> 87c61e8ee64cdbd50f191d39610eb58f with allocation id 
> AllocationID{2480907777440beb3accbb559b060a3c}.
> 21:41:49,716 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{4e63f5ba519d83764a2e06611285d930}] 
> timed out.
> 21:41:49,717 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{7fa61d06d579e3ac55456b46e7f6333e}] 
> timed out.
> 21:41:49,717 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{a7197d88984291a7b89beda98ae351d4}] 
> timed out.
> 21:41:49,719 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph     
>    - Job Foo (87c61e8ee64cdbd50f191d39610eb58f) switched from state RUNNING 
> to FAILING.
> Could not allocate all requires slots within timeout of 300000 ms. Slots 
> required: 384, slots allocated: 369
> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
> java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source)
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown 
> Source)
> java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
> java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
> org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
> java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
> java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
> java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
> akka.dispatch.OnComplete.internal(Future.scala:258)
> akka.dispatch.OnComplete.internal(Future.scala:256)
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:20)
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:18)
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 21:41:49,726 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{c88a96628d5a13e5ee14371f62f45866}] 
> timed out.
> 21:41:49,726 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{e81eb2b9ef6bf9b5a6f2299b69328b80}] 
> timed out.
> 21:41:49,726 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{dfa16526ccec6297ba6587d9fbd60993}] 
> timed out.
> 21:41:49,726 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{10ad63aceecbb72a709f57b3a6f13437}] 
> timed out.
> 21:41:49,726 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool       
>    - Pending slot request [SlotRequestId{8462b3a4890330f261ab41208e863d00}] 
> timed out.
> {noformat}
>  
> This repeats itself until the job is canceled.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-10617) Restoring job fails because of slot allocation timeout

Reply via email to