[jira] [Commented] (FLINK-25855) DefaultDeclarativeSlotPool rejects offered slots when the job is restarting

Chesnay Schepler (Jira) Thu, 27 Jan 2022 06:00:17 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483143#comment-17483143
 ]


Chesnay Schepler commented on FLINK-25855:
------------------------------------------

But it has issues with how to handle timeouts, and could easily cause the JM to 
accept slot offers but this signal not being returned to the TM.

> DefaultDeclarativeSlotPool rejects offered slots when the job is restarting
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-25855
>                 URL: https://issues.apache.org/jira/browse/FLINK-25855
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Till Rohrmann
>            Priority: Major
>
> The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is 
> currently restarting. The problem is that in case of a job restart, the 
> scheduler sets the required resources to zero. Hence, all offered slots will 
> be rejected.
> This is a problem for local recovery because rejected slots will be freed by 
> the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in 
> order to properly support local recovery, we need to handle this situation 
> somehow. I do see different options here:
> This problem only affects the {{DefaultScheduler}} since the 
> {{AdaptiveScheduler}} sets the required resources when transitioning into the 
> {{WaitingForResources}} state.
> h4. Accept excess slots
> Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts 
> slots which exceed the currently required set of slots. 
> Advantages: 
> * Easy to implement
> Disadvantages:
> * Offered slots that are not really needed will only be freed after the idle 
> slot timeout. This means that some resources might be left unused for some 
> time.
> h4. Let DefaultDeclarativeSlotPool accept excess slots only when job is 
> restarting
> Here the idea is to only accept excess slots when the job is currently 
> restarting. This will required that the scheduler tells the 
> {{DefaultDeclarativeSlotPool}} about the restarting state.
> Advantages:
> * We would only accept excess slots for the time of restarting
> Disadvantages:
> * We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}. 
> Moreover, we are introducing additional signals that communicate the 
> restarting state to the pool.
> h4. Don't immediately free slots on the TaskExecutor when they are rejected
> Instead of freeing the slot immediately on the {{TaskExecutor}} after it is 
> rejected. We could also retry for some time and only free the slot after some 
> timeout.
> Advantages:
> * No changes on the JobMaster side needed.
> Disadvantages:
> * Complication of the slot lifecycle on the {{TaskExecutor}}
> * Unneeded slots are not made available for other jobs as fast as possible
> h4. Don't zero resource requirements during job restart
> Instead of zeroing the resource requirements during a job restart, we could 
> also keep the last know requirements. Once the job is restarted, we could 
> adjust the requirements.
> Advantages:
> * Conceptually easy to do
> Disadvantages:
> * The old requirements mustn't necessarily be the new ones
> * Convolutes logic in the scheduler



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25855) DefaultDeclarativeSlotPool rejects offered slots when the job is restarting

Reply via email to