[
https://issues.apache.org/jira/browse/FLINK-18293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213190#comment-17213190
]
Till Rohrmann commented on FLINK-18293:
---------------------------------------
I think a {{FAILED}} task state does not mean that the {{Task}} has been
removed from the {{TaskSlot}}. I think the {{Task}} needs to report the final
state for the {{Task}} being removed from the {{TaskSlot}}. If you look for
"Un-registering task and sending final execution state {} to JobManager for
task {} {}." then you see when the task is being removed from the {{TaskSlot}}.
> TaskExecutor offering non empty slots can lead to resource violation
> --------------------------------------------------------------------
>
> Key: FLINK-18293
> URL: https://issues.apache.org/jira/browse/FLINK-18293
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.10.1, 1.11.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.12.0
>
>
> When a {{JobMaster}} loses leadership, then the {{TaskExecutor}} will fail
> all running tasks belonging to this job and transition all slots belonging to
> this job from {{ACTIVE}} into {{ALLOCATED}}. The idea is that these slots can
> be re-offered to the new leader of the very same job.
> A problem arises when the {{Task}} cancellation takes longer than the
> election of the new leader. In this case, the slot containing a
> {{CANCELLING}} task, will be offered to the new {{JobMaster}} as empty. The
> {{JobMaster}} not knowing that the slot still contains a resource consumer
> might deploy new tasks into it believing that these tasks can use all of the
> available resources. In the best case, the newly deployed {{Tasks}} will
> simply get fewer resources than thought. In the worst case this will lead to
> a resource violation.
> W/o the {{JobMaster}} being able to reconcile the state of already deployed
> {{Tasks}} into {{Slots}}, I believe that we should only re-offer the slot
> when it is free. One might model this scenario with introducing a new
> {{TaskSlotState.CLEANING}}. {{CLEANING}} means that the slot is still
> allocated for a given job but that there are still some resources which need
> to be cleaned up before it can be re-offered (transition to state
> {{ALLOCATED}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)