On Mon, 2005-10-17 at 11:06 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > Over night the trackers appeared to have difficulties finding a block
> > from the datanode and eventually exited. The datanode reports serving
> > the block mentioned below successfully on several earlier occasions.
>
> Were all of the datanodes still alive? If so, then the problem is
> probably that heartbeats from the datanodes to the namenode were
> delayed, so that the namenode assumed the datanodes were dead. This can
Yes. All of the datanodes, namenode and jobtracker were alive and I
don't recall them having any indication of errors in the logs but the
logs have been rotated since and I don't have them anymore.
The machine the namenode is running on does have very high load at
times. Do you recommend a separate box for the namenode and jobtracker
which runs strictly those items?
> sometimes happen when the namenode becomes severely loaded. What is in
> the namenode's logs around this time? Does it report some datanodes as
> lost?
Sorry, I don't have the logs anymore. I will be sure to check for it
next time the problem occurs.
> > Instead of the tracker exiting I would have expected the current task to
> > be aborted and for it to continue on taking others. Below are the logs
> > from the problem with the block through to where the tracker was no
> > longer running.
> [ ... ]
> > Exception in thread "main" java.util.ConcurrentModificationException
> > at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
> > at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
> > at
> > org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
> > at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
> > at
> > org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)
>
> I have seen this before. It is a bug in the TaskTracker that I have not
> yet had time to fix. It is triggered by an unresponsive jobtracker,
> which times out the tasktracker, assuming it is dead. Then, when the
> tasktracker's heartbeat arrives, the jobtracker does not recognize it,
> which causes the tasktracker to close and restart. The bug is in the
> tasktracker's close method, which uses an iterator in an unsafe way.
>
> It should instead use something like:
>
> while (tasks.size() != 0) {
> TaskInProgress tip = (TaskInProgress)tasks.first();
> tip.jobHasFinished();
> }
>
> What's in the jobtracker logs around this time? Did it report this
> tasktracker as lost?
The jobtracker did not indicate such a thing (via an exception anyway).
Tasktracker connections seem to be established and disconnected from
fairly frequently. Perhaps this is what you mean?
> Was this all by chance at the start of an Indexer job? That's when I've
> seen this sort of thing. This job has more input files than any other,
Middle of a fetch actually.
--
Rod Taylor <[EMAIL PROTECTED]>