What happens when a task in a running job fails? Will all the current
executions of the job's tasks fail? Will all the slots being used by the job
tasks (failed and non-failed ones) be released.
Assuming all the slots are released, wouldn’t it make sense to:
1. “stop” the non-failed tasks and keep them holding their slots. Where “stop”
may mean something like stop processing tuples from the input streams.
2. Re-schedule the failed task to a new slot (imagine the task failed because
the Task Manager owning that slot failed)
3. Recover state to the last snapshot.
4. “re-start”. Where “re-start” means start processing tuples from the input