Rod Taylor wrote:
Over night the trackers appeared to have difficulties finding a block
from the datanode and eventually exited. The datanode reports serving
the block mentioned below successfully on several earlier occasions.
Were all of the datanodes still alive? If so, then the problem is
probably that heartbeats from the datanodes to the namenode were
delayed, so that the namenode assumed the datanodes were dead. This can
sometimes happen when the namenode becomes severely loaded. What is in
the namenode's logs around this time? Does it report some datanodes as
lost?
Instead of the tracker exiting I would have expected the current task to
be aborted and for it to continue on taking others. Below are the logs
from the problem with the block through to where the tracker was no
longer running.
[ ... ]
Exception in thread "main" java.util.ConcurrentModificationException
at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
at
org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)
I have seen this before. It is a bug in the TaskTracker that I have not
yet had time to fix. It is triggered by an unresponsive jobtracker,
which times out the tasktracker, assuming it is dead. Then, when the
tasktracker's heartbeat arrives, the jobtracker does not recognize it,
which causes the tasktracker to close and restart. The bug is in the
tasktracker's close method, which uses an iterator in an unsafe way.
It should instead use something like:
while (tasks.size() != 0) {
TaskInProgress tip = (TaskInProgress)tasks.first();
tip.jobHasFinished();
}
What's in the jobtracker logs around this time? Did it report this
tasktracker as lost?
Was this all by chance at the start of an Indexer job? That's when I've
seen this sort of thing. This job has more input files than any other,
and my guess is that constructing the splits (which ties up the
jobtracker) can for some reason take longer than the timeout, although
it really shouldn't...
Doug