On 23-apr-2007, at 12:14, Mathijs Homminga wrote:
I have had some troubles with 2 nodes on one of our clusters.While most nodes finished their map tasks successfully in about 2 secs, two were not responding well. On their Task Trackers the task status remained UNASSIGNED for a couple of minutes (and the Job Tracker receives no heartbeats) and then changed to RUNNING but in the end the task got killed after 600 secs because no status update had been received.I found out that this was caused by the fact that we had not installed the loopback interface correctly on these two nodes. So, although all machines could connect to each other, two of them could not connect to themselves.
Could you explain how you installed your loopback device now? I ran into a similar (maybe the same) problem, where I could only reach the _local_ tasktracker by poking a hole in my firewall.
Btw, could I have seen this in any of the logs?
I don't think so, it just times out. -- Regards, Eelco Lempsink
PGP.sig
Description: This is a digitally signed message part
