Stephan Ewen created FLINK-6666:
-----------------------------------

             Summary: RestartStrategy should differentiate between types of 
recovery (global / local / resource missing)
                 Key: FLINK-6666
                 URL: https://issues.apache.org/jira/browse/FLINK-6666
             Project: Flink
          Issue Type: Sub-task
          Components: Distributed Coordination
    Affects Versions: 1.3.0
            Reporter: Stephan Ewen


Currently, the {{RestrartStrategy}} has a single method that is called when a 
failure requires an ExecutionGraph restart.

With the new addition of incremental recovery, it is desirable to distinguish 
between the type of failover that happens.

I would suggest to extend the {{RestartStrategy}} to support three 
cases/methods:

  - {{restartGlobal()}} for a full restart recovery
  - {{restartLocal()}} for a recovery coordinated by the {{FailoverStrategy}}
  - {{restartOnMissingResources()}} if the failure cause was missing slots

The last case is interesting, in my opinion, because it is commonly desirable 
that regular failover has no delay, but failover on missing resources has a 
short delay (1s or so) to avoid very fast cycles of restart attempts (in 
standalone mode, there can easily be 100,000 restarts after a second, when no 
resources are available and no delay happens during restarts).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to