Stephan Ewen created FLINK-6666: ----------------------------------- Summary: RestartStrategy should differentiate between types of recovery (global / local / resource missing) Key: FLINK-6666 URL: https://issues.apache.org/jira/browse/FLINK-6666 Project: Flink Issue Type: Sub-task Components: Distributed Coordination Affects Versions: 1.3.0 Reporter: Stephan Ewen
Currently, the {{RestrartStrategy}} has a single method that is called when a failure requires an ExecutionGraph restart. With the new addition of incremental recovery, it is desirable to distinguish between the type of failover that happens. I would suggest to extend the {{RestartStrategy}} to support three cases/methods: - {{restartGlobal()}} for a full restart recovery - {{restartLocal()}} for a recovery coordinated by the {{FailoverStrategy}} - {{restartOnMissingResources()}} if the failure cause was missing slots The last case is interesting, in my opinion, because it is commonly desirable that regular failover has no delay, but failover on missing resources has a short delay (1s or so) to avoid very fast cycles of restart attempts (in standalone mode, there can easily be 100,000 restarts after a second, when no resources are available and no delay happens during restarts). -- This message was sent by Atlassian JIRA (v6.3.15#6346)