[jira] [Updated] (FLINK-6325) Refinement of slot reuse for task manager failure

zhijiang (JIRA) Wed, 19 Apr 2017 01:29:32 -0700

     [ 
https://issues.apache.org/jira/browse/FLINK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zhijiang updated FLINK-6325:
----------------------------
    Description: 
After task or TaskManager failure, the new execution attempt tries to take the 
slot from prior execution by default. It can get benefits for state recovery 
locality by RocksDB backend, and it actually makes sense for task failure 
scenario.

But for TaskManager failure scenario, the inside slot is recycled and can not 
be reused any more. When the inside execution resets to allocate slot from 
{{SlotPool}}, no slot can be matched by {{ResourceID}}, then it will try to 
match any other available slots by {{ResourceProfile}}. As a result, the other 
parallel execution's slot will be occupied by this execution in failed 
{{TaskManager}}, and all the following executions may not reuse the previous 
slots any more. It will bring bad effects for state recovery.

To solve this problem, we would like to request a new slot for re-deployment 
when attached with an unavailable location, so it will not occupy the other 
alive slots any more.

  was:
After task or TaskManager failure, the new execution attempt tries to take the 
slot from prior execution by default. It can get benefits for state recovery 
locality by RocksDB backend, and it actually makes sense for task failure 
scenario.
But for TaskManager failure scenario, the inside slot is recycled and can not 
be reused any more. When the inside execution resets to allocate slot from 
{{SlotPool}}, no slot can be matched by {{ResourceID}}, then it will try to 
match any other available slots by {{ResourceProfile}}. As a result, the other 
parallel execution's slot will be occupied by this execution in failed 
{{TaskManager}}, and all the following executions may not reuse the previous 
slots any more. It will bring bad effects for state recovery.
To solve this problem, we would like to request a new slot for re-deployment 
when attached with an unavailable location, so it will not occupy the other 
alive slots any more.


> Refinement of slot reuse for task manager failure
> -------------------------------------------------
>
>                 Key: FLINK-6325
>                 URL: https://issues.apache.org/jira/browse/FLINK-6325
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager
>            Reporter: zhijiang
>            Priority: Minor
>
> After task or TaskManager failure, the new execution attempt tries to take 
> the slot from prior execution by default. It can get benefits for state 
> recovery locality by RocksDB backend, and it actually makes sense for task 
> failure scenario.
> But for TaskManager failure scenario, the inside slot is recycled and can not 
> be reused any more. When the inside execution resets to allocate slot from 
> {{SlotPool}}, no slot can be matched by {{ResourceID}}, then it will try to 
> match any other available slots by {{ResourceProfile}}. As a result, the 
> other parallel execution's slot will be occupied by this execution in failed 
> {{TaskManager}}, and all the following executions may not reuse the previous 
> slots any more. It will bring bad effects for state recovery.
> To solve this problem, we would like to request a new slot for re-deployment 
> when attached with an unavailable location, so it will not occupy the other 
> alive slots any more.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (FLINK-6325) Refinement of slot reuse for task manager failure

Reply via email to