Github user mxm commented on the issue:

    https://github.com/apache/flink/pull/2571
  
    Great to hear that we're on the same page :)
    
    >I think it's no need to stick to the failed slot when the allocation fails 
by rpc. Just put it back to the free pool, and give us another shot.
    
    Yes, we can simply trigger processing of pending requests via 
`handleFreeSlot`.
    
    >Actually, i think the pending requests acts like your extra list of 
unconfirmed requests. (And you pointed at last, we actually dont need this list 
as TaskManager will correct our faultd by rejecting allocation).
    
    I think PendingRequests is not the same because it is a list of outstanding 
requests but not requests that have been issued to TaskExecutors. But as we 
found out, we don't need to have a special list for that on the ResourceManager 
side.
    
    >Yes, i also thought this might be a solution. And i think this can work 
with the Heartbeat manager, since if you cannot send the free message to RM, 
you will not be able to send heartbeat too. After some timeout, RM will treat 
the TaskManager as dead, and some garbage collection logic in RM will take care 
all the allocations and slots which belong to this TaskManager. 
    
    Are you saying you would rather let the HeartbeatManager send out the 
removal of slots? That would work but depending on the heartbeat interval this 
could take slightly longer. Semantically, it doesn't make much difference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to