[
https://issues.apache.org/jira/browse/FLINK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406474#comment-17406474
]
Chesnay Schepler commented on FLINK-24005:
------------------------------------------
The analysis isn't quite correct.
The issue is that we do not correctly determine whether a slot was fulfilling a
requirement or not for the DefaultScheduler.
When the task is failed then the slot is freed; the bridge decreases the
requirements as it should and the pool holds on to the slot as it should.
However, the pool still has that slot in the
{{slotToRequirementProfileMappings}}.
When the TM providing said slot now fails we look for which requirements the
slots from said TM have been fulfilling, such that we can provide that
information to the bridge so that it can reduce the requirements afterwards.
Which requirement the slot was fulfilling is derived from the
{{slotToRequirementProfileMappings}}.
Because the slot is still in there it is considered to still be fulfilling a
requirement, which we tell the bridge which in turn reduce the requirements,
now for the second time.
That the slot is still kept in the {{slotToRequirementProfileMappings}} is
generally fine. It is for example required by the AdaptiveScheduler because, in
contrast to the DefaultScheduler, it is able to free slots without reducing
requirements. But is also used when the slot is freed to figure out what
requirement it was fulfilling.
Solution-wise I don't think [~zhuzh]s original suggestion of not decrementing
requirements for lost slots could solve the issue. It wouldn't behave correctly
in the case where the slots are being used at the moment. It would remove the
slot from the backing pool, free the payload, but the call to freeReservedSlot
would then be a no-op because the slot was already removed; we'd end up never
reducing the requirements.
I would also prefer to not add more book-keeping into the bridge; we have
enough as is and keeping all that in sync will just be annoying. It would in
particular be unfortunate because so far the bridge never really had to deal
with resource requirements, and I don't think we should start that.
What I will try tomorrow is to make {{DefaultDeclarativeSlotPool#releaseSlots}}
a bit smarter such that it only determines the previous requirements for slots
that are currently reserved. That should also solve the issue.
> Resource requirements declaration may be incorrect if JobMaster disconnects
> with a TaskManager with available slots in the SlotPool
> -----------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-24005
> URL: https://issues.apache.org/jira/browse/FLINK-24005
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0, 1.12.5, 1.13.2
> Reporter: Zhu Zhu
> Assignee: Chesnay Schepler
> Priority: Critical
> Fix For: 1.14.0, 1.12.6, 1.13.3
>
> Attachments: decrease_resource_requirements.log
>
>
> When a TaskManager disconnects with JobMaster, it will trigger the
> `DeclarativeSlotPoolService#decreaseResourceRequirementsBy()` for all the
> slots that are registered to the JobMaster from the TaskManager. If the slots
> are still available, i.e. not assigned to any task, the
> `decreaseResourceRequirementsBy` may lead to incorrect resource requirements
> declaration.
> For example, there is one job with 3 source tasks only. It requires 3 slots
> and declares for 3 slots. Initially all the tasks are running. Suddenly one
> task failed and waits for some delay before restarting. The previous slot is
> returned to the SlotPool. Now the job requires 2 slots and declares for 2
> slots. At this moment, the TaskManager of that returned slot get lost. After
> the triggered `decreaseResourceRequirementsBy`, the job only declares for 1
> slot. Finally, when the failed task starts to re-schedule, the job will
> declare for 2 slots while it actually needs 3 slots.
> The attached log of a real job and logs of the added test in
> https://github.com/zhuzhurk/flink/commit/59ca0ac5fa9c77b97c6e8a43dcc53ca8a0ad6c37
> can demonstrate this case.
> Note that the real job is configured with a large
> "restart-strategy.fixed-delay.delay" and and large "slot.idle.timeout". So
> possibly in production it is a rare case.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)