[ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541755 ]
Arun C Murthy commented on HADOOP-1900: --------------------------------------- After more thought, some observations: 1. Heartbeat Interval Ranges I believe initial heartbeat interval of (clustersize/50) is too aggressive for small clusters e.g. it leads to 1s for 50-node cluster, 2s for 100 nodes etc. I state this with care since there isn't much tasks can accomplish in a 2-3second interval. Instead, speaking from experience I'd like to see the chosen algorithm achieve the following intervals for the given cluster sizes: || Cluster Size || Heartbeat Interval (in secs) || | < 100 | 5s | | 100-500 | 5s- 10s | | 500-1000 | 10s-15s | | 1000-1500 | 15s-20s | | 1500-2000 | 20+ s | These numbers are in-line with observed performance on real-world clusters, and also keeping in mind that any interval <5s is probably not going to be able to update much. 2. Dynamic Scaling of HeartBeat Intervals I propose we model the back-off strategy loosely on TCP's _slow start_, i.e. put reliability above performance. When we notice a significant number of dropped RPCs the first thing is to ensure that it doesn't occur again. Keeping that in mind I propose we double the current heartbeat interval (upto the above limits, section 1), and keep doubling till we see no more dropped calls. Once we achieve that reliability goal, I propose we decrease the heartbeat interval slowly (say by 1s at a time) till we achieve stability i.e. no more dropped calls. E.g. Cluster size of 100 nodes. || Time || Noticed Behaviour || Reaction on Heartbeat Interval || | t0 | | 5s | | t1| dropped calls (say 10% of cluster-size i.e. 10 dropped calls) | Increase to 10s | | t2 | no more dropped calls | decrease to 9s | | t3 | no more dropped calls | decrease to 8s | | t4 | no more dropped calls | decrease to 7s | | t4 | dropped calls | increase to 8s | | t5 | no more dropped calls | stabilize at 8s | Thoughts? > the heartbeat and task event queries interval should be set dynamically by > the JobTracker > ----------------------------------------------------------------------------------------- > > Key: HADOOP-1900 > URL: https://issues.apache.org/jira/browse/HADOOP-1900 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Owen O'Malley > Assignee: Amareshwari Sri Ramadasu > Attachments: patch-1900.txt, patch-1900.txt > > > The JobTracker should scale the intervals that the TaskTrackers use to > contact it dynamically, based on how the busy it is and the size of the > cluster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.