[
https://issues.apache.org/jira/browse/FLINK-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-7646:
----------------------------------
Labels: auto-deprioritized-major stale-minor (was:
auto-deprioritized-major)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issues has been marked as
Minor but is unassigned and neither itself nor its Sub-Tasks have been updated
for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is
still Minor, please either assign yourself or give an update. Afterwards,
please remove the label or in 7 days the issue will be deprioritized.
> Restart failed jobs with configurable parallelism range
> -------------------------------------------------------
>
> Key: FLINK-7646
> URL: https://issues.apache.org/jira/browse/FLINK-7646
> Project: Flink
> Issue Type: Improvement
> Components: API / DataStream
> Affects Versions: 1.3.2
> Reporter: Elias Levy
> Priority: Minor
> Labels: auto-deprioritized-major, stale-minor
>
> Currently, if a TaskManager fails the whole job is terminated and then,
> depending on the restart policy, may be attempted to be restarted. If the
> failed TaskManager has not been replaced, and there are no spare task slots
> in the cluster, the job will fail to be restarted.
> There are situations where restoring or adding a new TaskManager may take a
> while For instance, in AWS an Auto Scaling Group can only be used to manage
> a group of instances in a single availability zone. If you have a cluster of
> TaskManagers that spans an AZ, managed by one ASG per AZ, and an AZ goes
> dark, the other ASGs won't scale automatically to make up for the lost
> TaskManagers. To resolve the situation the healthy ASGs will need to be
> modified manually or by systems external to AWS.
> With that in mind, it would be useful if you could specify a range for the
> parallelism parameter. Under normal circumstances the job would execute with
> the maximum parallelism of the range. But if TaskManagers were lost and not
> replaced after some time, the job would accept being execute with some lower
> parallelism within the range.
> I understand that this may not be feasible with checkpoints, as savepoints
> are supposed to be the mechanism used to change parallelism of a stateful
> job. Therefore, this proposal may need to wait until the implementation of
> the periodic savepoint feature (FLINK-4511).
> This feature would aid the availability of Flink jobs.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)