[
https://issues.apache.org/jira/browse/FLINK-25855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483129#comment-17483129
]
Chesnay Schepler commented on FLINK-25855:
------------------------------------------
Another alternative is delaying the processing of slot offers while the job is
not in a created/running state. This could be done in the JobMaster, which
already is being informed of job status changes.
That would only touch the job master and should be fairly simple to implement.
> DefaultDeclarativeSlotPool rejects offered slots when the job is restarting
> ---------------------------------------------------------------------------
>
> Key: FLINK-25855
> URL: https://issues.apache.org/jira/browse/FLINK-25855
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.15.0, 1.14.3
> Reporter: Till Rohrmann
> Priority: Major
>
> The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is
> currently restarting. The problem is that in case of a job restart, the
> scheduler sets the required resources to zero. Hence, all offered slots will
> be rejected.
> This is a problem for local recovery because rejected slots will be freed by
> the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in
> order to properly support local recovery, we need to handle this situation
> somehow. I do see different options here:
> This problem only affects the {{DefaultScheduler}} since the
> {{AdaptiveScheduler}} sets the required resources when transitioning into the
> {{WaitingForResources}} state.
> h4. Accept excess slots
> Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts
> slots which exceed the currently required set of slots.
> Advantages:
> * Easy to implement
> Disadvantages:
> * Offered slots that are not really needed will only be freed after the idle
> slot timeout. This means that some resources might be left unused for some
> time.
> h4. Let DefaultDeclarativeSlotPool accept excess slots only when job is
> restarting
> Here the idea is to only accept excess slots when the job is currently
> restarting. This will required that the scheduler tells the
> {{DefaultDeclarativeSlotPool}} about the restarting state.
> Advantages:
> * We would only accept excess slots for the time of restarting
> Disadvantages:
> * We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}.
> Moreover, we are introducing additional signals that communicate the
> restarting state to the pool.
> h4. Don't immediately free slots on the TaskExecutor when they are rejected
> Instead of freeing the slot immediately on the {{TaskExecutor}} after it is
> rejected. We could also retry for some time and only free the slot after some
> timeout.
> Advantages:
> * No changes on the JobMaster side needed.
> Disadvantages:
> * Complication of the slot lifecycle on the {{TaskExecutor}}
> * Unneeded slots are not made available for other jobs as fast as possible
> h4. Don't zero resource requirements during job restart
> Instead of zeroing the resource requirements during a job restart, we could
> also keep the last know requirements. Once the job is restarted, we could
> adjust the requirements.
> Advantages:
> * Conceptually easy to do
> Disadvantages:
> * The old requirements mustn't necessarily be the new ones
> * Convolutes logic in the scheduler
--
This message was sent by Atlassian Jira
(v8.20.1#820001)