Till Rohrmann created FLINK-25855:
-------------------------------------

             Summary: DefaultDeclarativeSlotPool rejects offered slots when the 
job is restarting
                 Key: FLINK-25855
                 URL: https://issues.apache.org/jira/browse/FLINK-25855
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
    Affects Versions: 1.14.3, 1.15.0
            Reporter: Till Rohrmann


The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is 
currently restarting. The problem is that in case of a job restart, the 
scheduler sets the required resources to zero. Hence, all offered slots will be 
rejected.

This is a problem for local recovery because rejected slots will be freed by 
the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in 
order to properly support local recovery, we need to handle this situation 
somehow. I do see different options here:

h3. Accept excess slots
Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts 
slots which exceed the currently required set of slots. 

Advantages: 
* Easy to implement

Disadvantages:
* Offered slots that are not really needed will only be freed after the idle 
slot timeout. This means that some resources might be left unused for some time.

h3. Let DefaultDeclarativeSlotPool accept excess slots when job is restarting
Here the idea is to only accept excess slots when the job is currently 
restarting. This will required that the scheduler tells the 
{{DefaultDeclarativeSlotPool}} about the restarting state.

Advantages:
* We would only accept excess slots for the time of restarting

Disadvantages:
* We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}. 
Moreover, we are introducing additional signals that communicate the restarting 
state to the pool.


h3. Don't immediately free slots on the TaskExecutor when they are rejected
Instead of freeing the slot immediately on the {{TaskExecutor}} after it is 
rejected. We could also retry for some time and only free the slot after some 
timeout.

Advantages:
* No changes on the JobMaster side needed.

Disadvantages:
* Complication of the slot lifecycle on the {{TaskExecutor}}
* Unneeded slots are not made available for other jobs as fast as possible



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to