All: I have a situation where I have to rely on less than stellar hosts files right now. This will be cleaned up in the future. For now, I wanted to get some verification on how task trackers figure out and communicate their IP / hostname to the JT.
When a task tracker starts, it performs some voodoo to figure out its machine name and IP address. Here is where I think things go south for me. It seems to be in o.a.h.mapred.TaskTracker#initialize(). A config variable mapreduce.tasktracker.host.name is pulled from the supplied JobConf in the constructor. It seems that this would allow one to get around a guessed hostname and IP due to a bad hosts file but nothing I do seems to affect it in a meaningful way. Setting this in mapred-site.xml has no effect. I also noticed that TaskTracker uses o.a.h.net.NetUtils which is a bit strange. There is some notion of a static host map; is this exposed via configuration somewhere? I've tried setting the TT HTTP listen address explicitly as well as the DNS interface property to its proper value, but nothing seems to work. The exact problem I'm fighting is too many fetch failures during jobs. It looks like task trackers are trying to fetch mapper outputs from 127.0.0.1. 2010-01-19 21:55:06,791 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_localhost. localdomain:localhost.localdomain/127.0.0.1:43817 ... 2010-01-19 22:06:52,726 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 127.0.0.1:50060, dest: 127.0.0.1:40975, bytes: 0, op: MAPRED_ SHUFFLE, cliID: attempt_201001192118_0002_m_000002_0 These log entries seem to indicate that, regardless of any settings, this task tracker is selecting localhost.localdomain/127.0.0.1 no matter what. The second entry looks like the bad fetch of map output I mentioned. Eventually this job dies with too many fetch failures. Removing all task trackers except for one running on the same machine as the JT works as expected. After reading through the code (as best I can) and tracing some of the machine name resolution bits, it seems as if the machine's configured hostname (and the IP it resolves to, by whatever means) is the address advised by the TT. Is this correct? If not, what am I missing? Is there any way to force a TT to advertise a specific hostname (and related IP) regardless of the host's configuration? If not, does anyone else feel like there should be? I completely understand the correct answer is to fix the hosts file or not depend on it at all, deferring to DNS. But, it does seem like this bit of the code is overly complicated and brittle. Thoughts? Thanks. -- Eric Sammer e...@lifeless.net http://esammer.blogspot.com