[ 
https://issues.apache.org/jira/browse/FLINK-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235847#comment-16235847
 ] 

ASF GitHub Bot commented on FLINK-7870:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4887#discussion_r148551966
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManager.java
 ---
    @@ -302,7 +302,12 @@ public boolean unregisterSlotRequest(AllocationID 
allocationId) {
                PendingSlotRequest pendingSlotRequest = 
pendingSlotRequests.remove(allocationId);
     
                if (null != pendingSlotRequest) {
    -                   cancelPendingSlotRequest(pendingSlotRequest);
    +                   if (pendingSlotRequest.isAssigned()) {
    +                           cancelPendingSlotRequest(pendingSlotRequest);
    +                   }
    +                   else {
    +                           
resourceActions.cancelResourceAllocation(pendingSlotRequest.getResourceProfile());
    --- End diff --
    
    I think we should not immediately cancel ongoing resource allocations. The 
`SlotManager` could decide upon registration of a new worker whether this one 
is actually needed or not. In the latter case it could release the resource. 
This would also simplify the protocol since you don't know whether you still 
can cancel an ongoing resource allocation.


> SlotPool should cancel the slot request to RM if not need any more.
> -------------------------------------------------------------------
>
>                 Key: FLINK-7870
>                 URL: https://issues.apache.org/jira/browse/FLINK-7870
>             Project: Flink
>          Issue Type: Bug
>          Components: Cluster Management
>            Reporter: shuai.xu
>            Assignee: shuai.xu
>            Priority: Major
>              Labels: flip-6
>
> 1. SlotPool will request slot to rm if its slots are not enough.
> 2. If a slot request is not fulfilled in a certain time, SlotPool will treat 
> the request as timeout and send a new slot request by triggering a failover 
> in JobMaster, the previous request is not needed any more, but rm does not 
> know it.
> 3. This may cause the rm request much more resource than the job really need.
> For example:
> 1. A job need 100 slots. RM request 100 container to YARN.
> 2. But YARN is busy now, it has no resource for the job.
> 3. The job failover as the resource request not fulfilled in time.
> 4. It ask 100 slots again, now RM request 200 container to YARN.
> 5. If failover server time, the containers request  will become more and more.
> 6. Now YARN has resource, it will find that the job may need thousands of 
> containers. This is a waste of resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to