[
https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541755
]
Arun C Murthy commented on HADOOP-1900:
---------------------------------------
After more thought, some observations:
1. Heartbeat Interval Ranges
I believe initial heartbeat interval of (clustersize/50) is too aggressive for
small clusters e.g. it leads to 1s for 50-node cluster, 2s for 100 nodes etc. I
state this with care since there isn't much tasks can accomplish in a 2-3second
interval. Instead, speaking from experience I'd like to see the chosen
algorithm achieve the following intervals for the given cluster sizes:
|| Cluster Size || Heartbeat Interval (in secs) ||
| < 100 | 5s |
| 100-500 | 5s- 10s |
| 500-1000 | 10s-15s |
| 1000-1500 | 15s-20s |
| 1500-2000 | 20+ s |
These numbers are in-line with observed performance on real-world clusters, and
also keeping in mind that any interval <5s is probably not going to be able to
update much.
2. Dynamic Scaling of HeartBeat Intervals
I propose we model the back-off strategy loosely on TCP's _slow start_, i.e.
put reliability above performance. When we notice a significant number of
dropped RPCs the first thing is to ensure that it doesn't occur again. Keeping
that in mind I propose we double the current heartbeat interval (upto the above
limits, section 1), and keep doubling till we see no more dropped calls. Once
we achieve that reliability goal, I propose we decrease the heartbeat interval
slowly (say by 1s at a time) till we achieve stability i.e. no more dropped
calls.
E.g.
Cluster size of 100 nodes.
|| Time || Noticed Behaviour || Reaction on Heartbeat Interval ||
| t0 | | 5s |
| t1| dropped calls (say 10% of cluster-size i.e. 10 dropped calls) | Increase
to 10s |
| t2 | no more dropped calls | decrease to 9s |
| t3 | no more dropped calls | decrease to 8s |
| t4 | no more dropped calls | decrease to 7s |
| t4 | dropped calls | increase to 8s |
| t5 | no more dropped calls | stabilize at 8s |
Thoughts?
> the heartbeat and task event queries interval should be set dynamically by
> the JobTracker
> -----------------------------------------------------------------------------------------
>
> Key: HADOOP-1900
> URL: https://issues.apache.org/jira/browse/HADOOP-1900
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assignee: Amareshwari Sri Ramadasu
> Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to
> contact it dynamically, based on how the busy it is and the size of the
> cluster.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.