[
https://issues.apache.org/jira/browse/FLINK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063855#comment-15063855
]
ASF GitHub Bot commented on FLINK-3184:
---------------------------------------
Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/1468#issuecomment-165754886
The idea is to decouple the restart logic from the `JobManager` and to make
it configurable on a per job basis. Different strategies are conceivable. For
instance, what we have right now, a fixed delay restart strategy. Additions
could be an exponential backoff restart strategy or later a scale in/out
restart strategy. Furthermore, this allows to set the delays on a per job basis
which might be relevant for specific SLAs.
But in general it's more like a preliminary step towards the scale in/out
restart strategy, I guess.
> Decrease Akka timeouts on cluster side to make system more responsive
> ---------------------------------------------------------------------
>
> Key: FLINK-3184
> URL: https://issues.apache.org/jira/browse/FLINK-3184
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Minor
>
> Currently, the default timeout for futures is set to 100 s. This also the
> time used to wait in between restart attempts if no other value has been
> explicitly specified. Especially in the streaming case, it is often necessary
> to detect failures and to react to failures in shorter period than 100 s.
> Therefore, I propose to decrease the default timeout to 10 s.
> Additionally, I propose to introduce a slightly higher timeout for the client
> side (e.g. 60 s). The reason is that in case of a {{JobManager}} the client
> has to wait until the cluster has recovered. Using ZooKeeper for that can
> entail a longer timeout than 10 s. In such a case a recovery could be falsely
> recognized as a lost connection.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)