Elias Levy created FLINK-7646:
---------------------------------

             Summary: Restart failed jobs with configurable parallelism range
                 Key: FLINK-7646
                 URL: https://issues.apache.org/jira/browse/FLINK-7646
             Project: Flink
          Issue Type: Improvement
          Components: DataStream API
    Affects Versions: 1.3.2
            Reporter: Elias Levy


Currently, if a TaskManager fails the whole job is terminated and then, 
depending on the restart policy, may be attempted to be restarted.  If the 
failed TaskManager has not been replaced, and there are no spare task slots in 
the cluster, the job will fail to be restarted.

There are situations where restoring or adding a new TaskManager may take a 
while  For instance, in AWS an Auto Scaling Group can only be used to manage a 
group of instances in a single availability zone.  If you have a cluster of 
TaskManagers that spans an AZ, managed by one ASG per AZ, and an AZ goes dark, 
the other ASGs won't scale automatically to make up for the lost TaskManagers.  
To resolve the situation the healthy ASGs will need to be modified manually or 
by systems external to AWS.

With that in mind, it would be useful if you could specify a range for the 
parallelism parameter.  Under normal circumstances the job would execute with 
the maximum parallelism of the range.  But if TaskManagers were lost and not 
replaced after some time, the job would accept being execute with some lower 
parallelism within the range.

I understand that this may not be feasible with checkpoints, as savepoints are 
supposed to be the mechanism used to change parallelism of a stateful job.  
Therefore, this proposal may need to wait until the implementation of the 
periodic savepoint feature (FLINK-4511).

This feature would aid the availability of Flink jobs.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to