[jira] [Updated] (FLINK-18293) TaskExecutor offering non empty slots can lead to resource violation

Flink Jira Bot (Jira) Sat, 20 Nov 2021 02:40:10 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-18293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-18293:
-----------------------------------
    Labels: auto-deprioritized-major stale-minor  (was: 
auto-deprioritized-major)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I see this issues has been marked as 
Minor but is unassigned and neither itself nor its Sub-Tasks have been updated 
for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is 
still Minor, please either assign yourself or give an update. Afterwards, 
please remove the label or in 7 days the issue will be deprioritized.


> TaskExecutor offering non empty slots can lead to resource violation
> --------------------------------------------------------------------
>
>                 Key: FLINK-18293
>                 URL: https://issues.apache.org/jira/browse/FLINK-18293
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.1, 1.11.0
>            Reporter: Till Rohrmann
>            Priority: Minor
>              Labels: auto-deprioritized-major, stale-minor
>
> When a {{JobMaster}} loses leadership, then the {{TaskExecutor}} will fail 
> all running tasks belonging to this job and transition all slots belonging to 
> this job from {{ACTIVE}} into {{ALLOCATED}}. The idea is that these slots can 
> be re-offered to the new leader of the very same job.
> A problem arises when the {{Task}} cancellation takes longer than the 
> election of the new leader. In this case, the slot containing a 
> {{CANCELLING}} task, will be offered to the new {{JobMaster}} as empty. The 
> {{JobMaster}} not knowing that the slot still contains a resource consumer 
> might deploy new tasks into it believing that these tasks can use all of the 
> available resources. In the best case, the newly deployed {{Tasks}} will 
> simply get fewer resources than thought. In the worst case this will lead to 
> a resource violation.
> W/o the {{JobMaster}} being able to reconcile the state of already deployed 
> {{Tasks}} into {{Slots}}, I believe that we should only re-offer the slot 
> when it is free. One might model this scenario with introducing a new 
> {{TaskSlotState.CLEANING}}. {{CLEANING}} means that the slot is still 
> allocated for a given job but that there are still some resources which need 
> to be cleaned up before it can be re-offered (transition to state 
> {{ALLOCATED}}).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (FLINK-18293) TaskExecutor offering non empty slots can lead to resource violation

Reply via email to