Rod Taylor wrote:
Over night the trackers appeared to have difficulties finding a block
from the datanode and eventually exited. The datanode reports serving
the block mentioned below successfully on several earlier occasions.

Were all of the datanodes still alive? If so, then the problem is probably that heartbeats from the datanodes to the namenode were delayed, so that the namenode assumed the datanodes were dead. This can sometimes happen when the namenode becomes severely loaded. What is in the namenode's logs around this time? Does it report some datanodes as lost?

Instead of the tracker exiting I would have expected the current task to
be aborted and for it to continue on taking others. Below are the logs
from the problem with the block through to where the tracker was no
longer running.
[ ... ]
Exception in thread "main" java.util.ConcurrentModificationException
        at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
        at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
        at
org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
        at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
        at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)

I have seen this before. It is a bug in the TaskTracker that I have not yet had time to fix. It is triggered by an unresponsive jobtracker, which times out the tasktracker, assuming it is dead. Then, when the tasktracker's heartbeat arrives, the jobtracker does not recognize it, which causes the tasktracker to close and restart. The bug is in the tasktracker's close method, which uses an iterator in an unsafe way.

It should instead use something like:

  while (tasks.size() != 0) {
    TaskInProgress tip = (TaskInProgress)tasks.first();
    tip.jobHasFinished();
  }

What's in the jobtracker logs around this time? Did it report this tasktracker as lost?

Was this all by chance at the start of an Indexer job? That's when I've seen this sort of thing. This job has more input files than any other, and my guess is that constructing the splits (which ties up the jobtracker) can for some reason take longer than the timeout, although it really shouldn't...

Doug


Reply via email to