[ 
https://issues.apache.org/jira/browse/FLINK-27236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527508#comment-17527508
 ] 

yanpengshi commented on FLINK-27236:
------------------------------------

[~wanglijie95], thank you for your reply. I now understand the reason of the 
problem with your help..  Another related problem, 

1:The free slot is notified to the jm or slotpool only when the taskexecutor 
has no allocated slot for the job by 
closeJobManagerConnectionIfNoAllocatedResources 
(TaskExecutor::closeJobManagerConnectionIfNoAllocatedResources)

2: However, the free slot will be immediately notified to the RM by 
ResourceManagerGateway::notifySlotAvailable

 

I don't know whether I understand correctly. Can you tell me the reason?

> No task slot allocated for job in larege-scale job
> --------------------------------------------------
>
>                 Key: FLINK-27236
>                 URL: https://issues.apache.org/jira/browse/FLINK-27236
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.13.3
>            Reporter: yanpengshi
>            Priority: Major
>         Attachments: jobmanager.log.26, taskmanager.log, topology.png
>
>   Original Estimate: 444h
>  Remaining Estimate: 444h
>
> Hey,
>  
> We run a large-scale flink job containing six vertices with 3k parallelism. 
> The Topology is shown below.
> !topology.png!
> We meets the following exception in jobmanager.log:[^jobmanager.log.26]
> {code:java}
> 2022-03-02 08:01:16,601 INFO  [1998] 
> [org.apache.flink.runtime.executiongraph.Execution.transitionState(Execution.java:1446)]
>   - Source: tdbank_exposure_wx -> Flat Map (772/3000) 
> (6cd18d4ead1887a4e19fd3f337a6f4f8) switched from DEPLOYING to FAILED on 
> container_e03_1639558254334_10048_01_004716 @ 11.104.77.40 
> (dataPort=39313).java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException: No 
> task slot allocated for job ID 000000000000ed780000000000000087 and 
> allocation ID beb058d837c09e8d5a4a6aaf2426ca99. {code}
>  
> In the taskmanager.log [^taskmanager.log], the slot is freed due to timeout 
> and the taskmanager receives the new allocated request. By increasing the 
> value of key: taskmanager.slot.timeout, we can avoid this exception 
> temporarily.
> Here are some our guesses:
>  # When the job is scheduled, the slot and execution have been bound, and 
> then the task is deployed to the corresponding taskmanager.
>  # The slot is released after the idle interval times out and notify the 
> ResouceManager the slot free. Thus, the resourceManager will assign other 
> request to the slot.
>  # The task is deployed to taskmanager according the previous correspondence
>  
> The key problems are :
>  # When the slot is free, the execution is not unassigned from the slot;
>  # The slot state is not consistent in JobMaster and ResourceManager
>  
> Has anyone else encountered this problem? When the slot is freed, how can we 
> unassign the previous bounded execution? Or we need to update the resource 
> address of the execution. @[~zhuzh] @[~wanglijie95] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to