[jira] [Commented] (FLINK-6325) Refinement of slot reuse for task manager failure

zhijiang (JIRA) Wed, 19 Apr 2017 02:54:07 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974401#comment-15974401
 ]


zhijiang commented on FLINK-6325:
---------------------------------

[~StephanEwen], for specific implementation, we would like to add a flag to 
indicate whether request new slot or not for {{AllocateSlot}} in {{SlotPool}}. 
In {{TaskManager}} failure scenario, the flag will be true, then the 
{{SlotPool}} will request new slot from {{ResourceManager}} directly. 
In task failure scenario, the flag will be false, then the {{SlotPool}} will 
first match the previous slots based on {{TaskManagerLocation}} from 
{{AvailableSlots}} which keeps same with current mode.

Do you think this way makes sense or you have other suggestions? 

> Refinement of slot reuse for task manager failure
> -------------------------------------------------
>
>                 Key: FLINK-6325
>                 URL: https://issues.apache.org/jira/browse/FLINK-6325
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager
>            Reporter: zhijiang
>            Assignee: zhijiang
>            Priority: Minor
>
> After task or TaskManager failure, the new execution attempt tries to take 
> the slot from prior execution by default. It can get benefits for state 
> recovery locality by RocksDB backend, and it actually makes sense for task 
> failure scenario.
> But for TaskManager failure scenario, the inside slot is recycled and can not 
> be reused any more. When the inside execution resets to allocate slot from 
> {{SlotPool}}, no slot can be matched by {{ResourceID}}, then it will try to 
> match any other available slots by {{ResourceProfile}}. As a result, the 
> other parallel execution's slot will be occupied by this execution in failed 
> {{TaskManager}}, and all the following executions may not reuse the previous 
> slots any more. It will bring bad effects for state recovery.
> To solve this problem, we would like to request a new slot for re-deployment 
> when attached with an unavailable location, so it will not occupy the other 
> alive slots any more.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-6325) Refinement of slot reuse for task manager failure

Reply via email to