[ 
https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541755
 ] 

Arun C Murthy commented on HADOOP-1900:
---------------------------------------

After more thought, some observations:

1. Heartbeat Interval Ranges

I believe initial heartbeat interval of (clustersize/50) is too aggressive for 
small clusters e.g. it leads to 1s for 50-node cluster, 2s for 100 nodes etc. I 
state this with care since there isn't much tasks can accomplish in a 2-3second 
interval. Instead, speaking from experience I'd like to see the chosen 
algorithm achieve the following intervals for the given cluster sizes:

|| Cluster Size || Heartbeat Interval (in secs) ||
| < 100 | 5s |
| 100-500 | 5s- 10s |
| 500-1000 | 10s-15s |
| 1000-1500 | 15s-20s |
| 1500-2000 | 20+ s |

These numbers are in-line with observed performance on real-world clusters, and 
also keeping in mind that any interval <5s is probably not going to be able to 
update much.

2. Dynamic Scaling of HeartBeat Intervals

I propose we model the back-off strategy loosely on TCP's _slow start_, i.e. 
put reliability above performance. When we notice a significant number of 
dropped RPCs the first thing is to ensure that it doesn't occur again. Keeping 
that in mind I propose we double the current heartbeat interval (upto the above 
limits, section 1), and keep doubling till we see no more dropped calls. Once 
we achieve that reliability goal, I propose we decrease the heartbeat interval 
slowly (say by 1s at a time) till we achieve stability i.e. no more dropped 
calls.

E.g. 

Cluster size of 100 nodes.

|| Time || Noticed Behaviour || Reaction on Heartbeat Interval ||
| t0 | | 5s |
| t1| dropped calls (say 10% of cluster-size i.e. 10 dropped calls) | Increase 
to 10s |
| t2 | no more dropped calls | decrease to 9s |
| t3 | no more dropped calls | decrease to 8s |
| t4 | no more dropped calls | decrease to 7s |
| t4 | dropped calls | increase to 8s |
| t5 | no more dropped calls | stabilize at 8s |


Thoughts?

> the heartbeat and task event queries interval should be set dynamically by 
> the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to 
> contact it dynamically, based on how the busy it is and the size of the 
> cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to