[
https://issues.apache.org/jira/browse/FLINK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101594#comment-15101594
]
ASF GitHub Bot commented on FLINK-3184:
---------------------------------------
Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/1468#issuecomment-171928797
We could set the default execution retry delay to 0 assuming that any
longer timeout in a streaming use case would render the job wrong anyway.
However, if a longer timeout is acceptable, then we would lose the ability to
recover from a lost task manager which usually takes some time to reconnect to
the cluster (given that we use all instances).
I would be more in favour of having an exponential back off strategy as the
default. This would give us a quick recovery in case that we have enough
resources available but also the possibility to wait for a TM reconnection. We
could implement such a restart strategy once the PR #1470 is merged.
What do you think?
> Decrease Akka timeouts on cluster side to make system more responsive
> ---------------------------------------------------------------------
>
> Key: FLINK-3184
> URL: https://issues.apache.org/jira/browse/FLINK-3184
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Minor
>
> Currently, the default timeout for futures is set to 100 s. This also the
> time used to wait in between restart attempts if no other value has been
> explicitly specified. Especially in the streaming case, it is often necessary
> to detect failures and to react to failures in shorter period than 100 s.
> Therefore, I propose to decrease the default timeout to 10 s.
> Additionally, I propose to introduce a slightly higher timeout for the client
> side (e.g. 60 s). The reason is that in case of a {{JobManager}} the client
> has to wait until the cluster has recovered. Using ZooKeeper for that can
> entail a longer timeout than 10 s. In such a case a recovery could be falsely
> recognized as a lost connection.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)