Re: Behaviour when a task in a running job fails

Luis Alves Wed, 09 Aug 2017 15:48:29 -0700

Hi Till,

Awesome! Thanks a lot for your answer!


Currently I don’t have much free time to help with the 
implementation/contribute to Flink, but for sure I will in a near future!

Thanks again, 

Luís Alves

On Wed, 09 Aug 2017 at 10:29 Till Rohrmann

<
mailto:Till Rohrmann <trohrm...@apache.org>
> wrote:

a, pre, code, a:link, body { word-wrap: break-word !important; }

Hi Luis,

Flink's default behaviour is to fail all tasks and then redeploying the

tasks with a possible different task to slot mapping.

However, we recently introduced the infrastructure for a more fine-grained

recovery mechanism. With this infrastructure it will be possible to only

restart depended tasks which are effectively the connected components wrt

the job DAG. See FLINK-5869 for more information.

Moreover, we are working on rescheduling tasks to slots which executed the

task before (introducing a scheduling preference). That way we can benefit

from state which is still kept on these machines instead of having to

download it.

So basically, what you are writing is correct but also not as easy as to

implement. Some of the required functionality is already there and more

will follow with the upcoming 1.4 release. if you are interested in helping

out with the implementation then feel welcome!

Cheers,

Till

On Wed, Aug 9, 2017 at 12:45 AM, Luis Alves <
mailto:lmtjal...@gmail.com
> wrote:

> Hello,

>

> What happens when a task in a running job fails? Will all the current

> executions of the job's tasks fail? Will all the slots being used by the

> job tasks (failed and non-failed ones) be released.

>

> Assuming all the slots are released, wouldn’t it make sense to:

>

> 1. “stop” the non-failed tasks and keep them holding their slots. Where

> “stop” may mean something like stop processing tuples from the input

> streams.

>

> 2. Re-schedule the failed task to a new slot (imagine the task failed

> because the Task Manager owning that slot failed)

>

> 3. Recover state to the last snapshot.

>

> 4. “re-start”. Where “re-start” means start processing tuples from the

> input streams.

>

> Thanks,

>

> Luís Alves

Re: Behaviour when a task in a running job fails

Reply via email to