Re: All trackers exited on all nodes

Doug Cutting Mon, 17 Oct 2005 11:06:35 -0700

Rod Taylor wrote:

Over night the trackers appeared to have difficulties finding a block
from the datanode and eventually exited. The datanode reports serving
the block mentioned below successfully on several earlier occasions.

Were all of the datanodes still alive? If so, then the problem isprobably that heartbeats from the datanodes to the namenode weredelayed, so that the namenode assumed the datanodes were dead. This cansometimes happen when the namenode becomes severely loaded. What is inthe namenode's logs around this time? Does it report some datanodes aslost?

Instead of the tracker exiting I would have expected the current task to
be aborted and for it to continue on taking others. Below are the logs
from the problem with the block through to where the tracker was no
longer running.

[ ... ]

Exception in thread "main" java.util.ConcurrentModificationException
        at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
        at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
        at
org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
        at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
        at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)

I have seen this before. It is a bug in the TaskTracker that I have notyet had time to fix. It is triggered by an unresponsive jobtracker,which times out the tasktracker, assuming it is dead. Then, when thetasktracker's heartbeat arrives, the jobtracker does not recognize it,which causes the tasktracker to close and restart. The bug is in thetasktracker's close method, which uses an iterator in an unsafe way.


It should instead use something like:

  while (tasks.size() != 0) {
    TaskInProgress tip = (TaskInProgress)tasks.first();
    tip.jobHasFinished();
  }

What's in the jobtracker logs around this time? Did it report thistasktracker as lost?

Was this all by chance at the start of an Indexer job? That's when I'veseen this sort of thing. This job has more input files than any other,and my guess is that constructing the splits (which ties up thejobtracker) can for some reason take longer than the timeout, althoughit really shouldn't...


Doug

Re: All trackers exited on all nodes

Reply via email to