[ 
https://issues.apache.org/jira/browse/FLINK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063855#comment-15063855
 ] 

ASF GitHub Bot commented on FLINK-3184:
---------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-165754886
  
    The idea is to decouple the restart logic from the `JobManager` and to make 
it configurable on a per job basis. Different strategies are conceivable. For 
instance, what we have right now, a fixed delay restart strategy. Additions 
could be an exponential backoff restart strategy or later a scale in/out 
restart strategy. Furthermore, this allows to set the delays on a per job basis 
which might be relevant for specific SLAs.
    
    But in general it's more like a preliminary step towards the scale in/out 
restart strategy, I guess.


> Decrease Akka timeouts on cluster side to make system more responsive
> ---------------------------------------------------------------------
>
>                 Key: FLINK-3184
>                 URL: https://issues.apache.org/jira/browse/FLINK-3184
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Minor
>
> Currently, the default timeout for futures is set to 100 s. This also the 
> time used to wait in between restart attempts if no other value has been 
> explicitly specified. Especially in the streaming case, it is often necessary 
> to detect failures and to react to failures in shorter period than 100 s. 
> Therefore, I propose to decrease the default timeout to 10 s.
> Additionally, I propose to introduce a slightly higher timeout for the client 
> side (e.g. 60 s). The reason is that in case of a {{JobManager}} the client 
> has to wait until the cluster has recovered. Using ZooKeeper for that can 
> entail a longer timeout than 10 s. In such a case a recovery could be falsely 
> recognized as a lost connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to