[
https://issues.apache.org/jira/browse/FLINK-27236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527508#comment-17527508
]
yanpengshi commented on FLINK-27236:
------------------------------------
[~wanglijie95], thank you for your reply. I now understand the reason of the
problem with your help.. Another related problem,
1:The free slot is notified to the jm or slotpool only when the taskexecutor
has no allocated slot for the job by
closeJobManagerConnectionIfNoAllocatedResources
(TaskExecutor::closeJobManagerConnectionIfNoAllocatedResources)
2: However, the free slot will be immediately notified to the RM by
ResourceManagerGateway::notifySlotAvailable
I don't know whether I understand correctly. Can you tell me the reason?
> No task slot allocated for job in larege-scale job
> --------------------------------------------------
>
> Key: FLINK-27236
> URL: https://issues.apache.org/jira/browse/FLINK-27236
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.13.3
> Reporter: yanpengshi
> Priority: Major
> Attachments: jobmanager.log.26, taskmanager.log, topology.png
>
> Original Estimate: 444h
> Remaining Estimate: 444h
>
> Hey,
>
> We run a large-scale flink job containing six vertices with 3k parallelism.
> The Topology is shown below.
> !topology.png!
> We meets the following exception in jobmanager.log:[^jobmanager.log.26]
> {code:java}
> 2022-03-02 08:01:16,601 INFO [1998]
> [org.apache.flink.runtime.executiongraph.Execution.transitionState(Execution.java:1446)]
> - Source: tdbank_exposure_wx -> Flat Map (772/3000)
> (6cd18d4ead1887a4e19fd3f337a6f4f8) switched from DEPLOYING to FAILED on
> container_e03_1639558254334_10048_01_004716 @ 11.104.77.40
> (dataPort=39313).java.util.concurrent.CompletionException:
> org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException: No
> task slot allocated for job ID 000000000000ed780000000000000087 and
> allocation ID beb058d837c09e8d5a4a6aaf2426ca99. {code}
>
> In the taskmanager.log [^taskmanager.log], the slot is freed due to timeout
> and the taskmanager receives the new allocated request. By increasing the
> value of key: taskmanager.slot.timeout, we can avoid this exception
> temporarily.
> Here are some our guesses:
> # When the job is scheduled, the slot and execution have been bound, and
> then the task is deployed to the corresponding taskmanager.
> # The slot is released after the idle interval times out and notify the
> ResouceManager the slot free. Thus, the resourceManager will assign other
> request to the slot.
> # The task is deployed to taskmanager according the previous correspondence
>
> The key problems are :
> # When the slot is free, the execution is not unassigned from the slot;
> # The slot state is not consistent in JobMaster and ResourceManager
>
> Has anyone else encountered this problem? When the slot is freed, how can we
> unassign the previous bounded execution? Or we need to update the resource
> address of the execution. @[~zhuzh] @[~wanglijie95]
--
This message was sent by Atlassian Jira
(v8.20.7#820007)