[
https://issues.apache.org/jira/browse/FLINK-25855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483143#comment-17483143
]
Chesnay Schepler commented on FLINK-25855:
------------------------------------------
But it has issues with how to handle timeouts, and could easily cause the JM to
accept slot offers but this signal not being returned to the TM.
> DefaultDeclarativeSlotPool rejects offered slots when the job is restarting
> ---------------------------------------------------------------------------
>
> Key: FLINK-25855
> URL: https://issues.apache.org/jira/browse/FLINK-25855
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.15.0, 1.14.3
> Reporter: Till Rohrmann
> Priority: Major
>
> The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is
> currently restarting. The problem is that in case of a job restart, the
> scheduler sets the required resources to zero. Hence, all offered slots will
> be rejected.
> This is a problem for local recovery because rejected slots will be freed by
> the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in
> order to properly support local recovery, we need to handle this situation
> somehow. I do see different options here:
> This problem only affects the {{DefaultScheduler}} since the
> {{AdaptiveScheduler}} sets the required resources when transitioning into the
> {{WaitingForResources}} state.
> h4. Accept excess slots
> Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts
> slots which exceed the currently required set of slots.
> Advantages:
> * Easy to implement
> Disadvantages:
> * Offered slots that are not really needed will only be freed after the idle
> slot timeout. This means that some resources might be left unused for some
> time.
> h4. Let DefaultDeclarativeSlotPool accept excess slots only when job is
> restarting
> Here the idea is to only accept excess slots when the job is currently
> restarting. This will required that the scheduler tells the
> {{DefaultDeclarativeSlotPool}} about the restarting state.
> Advantages:
> * We would only accept excess slots for the time of restarting
> Disadvantages:
> * We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}.
> Moreover, we are introducing additional signals that communicate the
> restarting state to the pool.
> h4. Don't immediately free slots on the TaskExecutor when they are rejected
> Instead of freeing the slot immediately on the {{TaskExecutor}} after it is
> rejected. We could also retry for some time and only free the slot after some
> timeout.
> Advantages:
> * No changes on the JobMaster side needed.
> Disadvantages:
> * Complication of the slot lifecycle on the {{TaskExecutor}}
> * Unneeded slots are not made available for other jobs as fast as possible
> h4. Don't zero resource requirements during job restart
> Instead of zeroing the resource requirements during a job restart, we could
> also keep the last know requirements. Once the job is restarted, we could
> adjust the requirements.
> Advantages:
> * Conceptually easy to do
> Disadvantages:
> * The old requirements mustn't necessarily be the new ones
> * Convolutes logic in the scheduler
--
This message was sent by Atlassian Jira
(v8.20.1#820001)