[
https://issues.apache.org/jira/browse/MAPREDUCE-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhihai xu updated MAPREDUCE-6224:
---------------------------------
Status: Patch Available (was: Open)
> resolve the hosts in DNSToSwitchMapping before inter tracker server start to
> avoid IPC timeout in Task Tracker heartbeat
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6224
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6224
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mrv1
> Reporter: zhihai xu
> Assignee: zhihai xu
> Attachments: MAPREDUCE-6224.branch-1.000.patch
>
>
> Resolve the hosts to fill up the cache in CachedDNSToSwitchMapping before
> inter tracker server start to avoid IPC timeout in Task Tracker heartbeat.
> We saw IPC timeout happen in Task Tracker heartbeat for a large MR1 cluster
> which use topology script(ShellCommandExecutor) to resolve the Network
> Topology for Task Tracker host in ScriptBasedMapping.
> The reason is
> Right after inter tracker server start in Job Tracker, Job Tracker receive a
> lots HeartBeat from the Task Tracker.
> heartbeat function call resolveAndAddToTopology to resolve the Network
> Topology for Task Tracker host in ScriptBasedMapping which implement
> CachedDNSToSwitchMapping.
> ScriptBasedMapping#resolve will check whether the host is in the cache,
> If the host is not in the cache, it will run topology script to get the
> host's Network Topology using ShellCommandExecutor. Normally running topology
> script is time consuming, which may cause the IPC time if too many heartbeat
> happened at the same time for a large MR1 cluster.
> The solution is to resolve the Network Topology for all hosts in the hosts
> list from HostsFileReader before receive any heartbeat from Task Tracker, so
> the cache in ScriptBasedMapping will be filled up, and when heartbeat call
> resolveAndAddToTopology, it will get the result from the cache instead of
> running topology script.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)