Repository: reef Updated Branches: refs/heads/master dd14f0910 -> 0d78bf822
[REEF-1680] Increasing default retry time in WaitingForRegistration In IMRU, update task doesn't need to do data load but mapper do. That means we need to set sufficient retry time in WaitingForRegistration, at least more than data loading time. Current setting is about 4 min. But for big data, it can be 25 minutes. Previously we hesitated to increase this number because WaitingForRegistration happened in Task constructor. Before we get IRunningTask event, there is no way to cancel a long waiting task in failure cases. Now we have moved WaitingForRegistration out of task constructor and added cancellation token so that we can cancel the waiting if failure happens. This change increases the waiting time to 30min. JIRA: [REEF-1680](https://issues.apache.org/jira/browse/REEF-1680) Pull request: This closes #1194 Project: http://git-wip-us.apache.org/repos/asf/reef/repo Commit: http://git-wip-us.apache.org/repos/asf/reef/commit/0d78bf82 Tree: http://git-wip-us.apache.org/repos/asf/reef/tree/0d78bf82 Diff: http://git-wip-us.apache.org/repos/asf/reef/diff/0d78bf82 Branch: refs/heads/master Commit: 0d78bf8220a012c5d1b83db930ed431cfd521572 Parents: dd14f09 Author: Julia Wang <[email protected]> Authored: Tue Nov 29 19:20:04 2016 -0800 Committer: Mariia Mykhailova <[email protected]> Committed: Fri Dec 2 16:33:35 2016 -0800 ---------------------------------------------------------------------- .../Group/Config/GroupCommConfigurationOptions.cs | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/reef/blob/0d78bf82/lang/cs/Org.Apache.REEF.Network/Group/Config/GroupCommConfigurationOptions.cs ---------------------------------------------------------------------- diff --git a/lang/cs/Org.Apache.REEF.Network/Group/Config/GroupCommConfigurationOptions.cs b/lang/cs/Org.Apache.REEF.Network/Group/Config/GroupCommConfigurationOptions.cs index da645a2..51463d3 100644 --- a/lang/cs/Org.Apache.REEF.Network/Group/Config/GroupCommConfigurationOptions.cs +++ b/lang/cs/Org.Apache.REEF.Network/Group/Config/GroupCommConfigurationOptions.cs @@ -46,7 +46,7 @@ namespace Org.Apache.REEF.Network.Group.Config /// Each Communication group needs to check and wait until all the other nodes in the group are registered to the NameServer /// Sleep time is set between each retry. /// </summary> - [NamedParameter("sleep time to wait for nodes to be registered", defaultValue: "500")] + [NamedParameter("sleep time to wait for nodes to be registered", defaultValue: "2000")] internal sealed class SleepTimeWaitingForRegistration : Name<int> { } @@ -55,13 +55,11 @@ namespace Org.Apache.REEF.Network.Group.Config /// Each Communication group needs to check and wait until all the other nodes in the group are registered to the NameServer /// </summary> /// <remarks> - /// When there are many nodes, e.g over 100, the waiting time might be pretty long. - /// We don't want to set it too low in case some nodes are just slow, if we simply throw an exception, that is not right. - /// We don't want it to try endlessly in case some node is really dead, we should come out with exception. - /// We want it to return as soon as all nodes in the group are registered, So increasing retry count is better than increasing sleep time. - /// Current default sleep time is 500ms. Default retry is 500. Total is 250000ms, that is 250s, little bit more than 4 min + /// If a node is waiting for others that need to download data, the waiting time could be long. + /// As we can use cancellation token to cancel the waiting for registration, setting this number higher should be OK. + /// Current default sleep time is 2000ms. Default retry is 900. Total is 1800s, that is 30 min. /// </remarks> - [NamedParameter("Retry times to wait for nodes to be registered", defaultValue: "500")] + [NamedParameter("Retry times to wait for nodes to be registered", defaultValue: "900")] internal sealed class RetryCountWaitingForRegistration : Name<int> { }
