I think I found a bug but I am not 100% sure. It seems to me that in some
part of the code, probably in Job/TaskTracker, Hadoop always try to
resolve an IP address using the host name. If this is the case, then it is
a bug, because IP address are already known at this stage (masters know
the IP address of all slaves, while any slaves knows the master address in
the configuration file).
In fact, I have multiple IP addresses on my servers, and no DNS is set
because it is not necessary for a rack of machines used for internal
computing purpose.
Even though I have set explicitly the IP address that each master/slave
node should bind to, at some stage JT/TT seems still try to resolve an IP
address using the host name. This is a possible cause of Hadoop-1374,
which I have suffered from for the last two weeks.
I have this idea because after we disabled all network interfaces except
for the one we use for Hadoop, and started a DNS to resolve all host
names, the problem (Hadoop-1374) disappeared.
Several suggestions:
1. do not use hostname to resolve IP.
2. there are too many places in the configuration file to set IP
addresses, but I am afraid they are not actually used by Hadoop at all.
Only one IP address setting should be enough for each node.
3. use IP addresses instead of host names in logs and reports. At least
add IP addresses after the host names. In general, the error report
related to network problems is not accurate.
I am not very familar with Hadoop code yet. Please correct me if I am
wrong.
Thanks
Yunhong