Flink's default behaviour is to fail all tasks and then redeploying the
tasks with a possible different task to slot mapping.
However, we recently introduced the infrastructure for a more fine-grained
recovery mechanism. With this infrastructure it will be possible to only
restart depended tasks which are effectively the connected components wrt
the job DAG. See FLINK-5869 for more information.
Moreover, we are working on rescheduling tasks to slots which executed the
task before (introducing a scheduling preference). That way we can benefit
from state which is still kept on these machines instead of having to
So basically, what you are writing is correct but also not as easy as to
implement. Some of the required functionality is already there and more
will follow with the upcoming 1.4 release. if you are interested in helping
out with the implementation then feel welcome!
On Wed, Aug 9, 2017 at 12:45 AM, Luis Alves <lmtjal...@gmail.com> wrote:
> What happens when a task in a running job fails? Will all the current
> executions of the job's tasks fail? Will all the slots being used by the
> job tasks (failed and non-failed ones) be released.
> Assuming all the slots are released, wouldn’t it make sense to:
> 1. “stop” the non-failed tasks and keep them holding their slots. Where
> “stop” may mean something like stop processing tuples from the input
> 2. Re-schedule the failed task to a new slot (imagine the task failed
> because the Task Manager owning that slot failed)
> 3. Recover state to the last snapshot.
> 4. “re-start”. Where “re-start” means start processing tuples from the
> input streams.
> Luís Alves