[ 
https://issues.apache.org/jira/browse/FLINK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101594#comment-15101594
 ] 

ASF GitHub Bot commented on FLINK-3184:
---------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/1468#issuecomment-171928797
  
    We could set the default execution retry delay to 0 assuming that any 
longer timeout in a streaming use case would render the job wrong anyway. 
However, if a longer timeout is acceptable, then we would lose the ability to 
recover from a lost task manager which usually takes some time to reconnect to 
the cluster (given that we use all instances).
    
    I would be more in favour of having an exponential back off strategy as the 
default. This would give us a quick recovery in case that we have enough 
resources available but also the possibility to wait for a TM reconnection. We 
could implement such a restart strategy once the PR #1470 is merged.
    
    What do you think?


> Decrease Akka timeouts on cluster side to make system more responsive
> ---------------------------------------------------------------------
>
>                 Key: FLINK-3184
>                 URL: https://issues.apache.org/jira/browse/FLINK-3184
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Minor
>
> Currently, the default timeout for futures is set to 100 s. This also the 
> time used to wait in between restart attempts if no other value has been 
> explicitly specified. Especially in the streaming case, it is often necessary 
> to detect failures and to react to failures in shorter period than 100 s. 
> Therefore, I propose to decrease the default timeout to 10 s.
> Additionally, I propose to introduce a slightly higher timeout for the client 
> side (e.g. 60 s). The reason is that in case of a {{JobManager}} the client 
> has to wait until the cluster has recovered. Using ZooKeeper for that can 
> entail a longer timeout than 10 s. In such a case a recovery could be falsely 
> recognized as a lost connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to