On Mon, 2005-10-17 at 11:06 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > Over night the trackers appeared to have difficulties finding a block
> > from the datanode and eventually exited. The datanode reports serving
> > the block mentioned below successfully on several earlier occasions.
> 
> Were all of the datanodes still alive?  If so, then the problem is 
> probably that heartbeats from the datanodes to the namenode were 
> delayed, so that the namenode assumed the datanodes were dead.  This can 

Yes. All of the datanodes, namenode and jobtracker were alive and I
don't recall them having any indication of errors in the logs but the
logs have been rotated since and I don't have them anymore.

The machine the namenode is running on does have very high load at
times. Do you recommend a separate box for the namenode and jobtracker
which runs strictly those items?

> sometimes happen when the namenode becomes severely loaded.  What is in 
> the namenode's logs around this time?  Does it report some datanodes as 
> lost?

Sorry, I don't have the logs anymore. I will be sure to check for it
next time the problem occurs.

> > Instead of the tracker exiting I would have expected the current task to
> > be aborted and for it to continue on taking others. Below are the logs
> > from the problem with the block through to where the tracker was no
> > longer running.
> [ ... ]
> > Exception in thread "main" java.util.ConcurrentModificationException
> >         at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
> >         at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
> >         at
> > org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
> >         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
> >         at
> > org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)
> 
> I have seen this before.  It is a bug in the TaskTracker that I have not 
> yet had time to fix.  It is triggered by an unresponsive jobtracker, 
> which times out the tasktracker, assuming it is dead.  Then, when the 
> tasktracker's heartbeat arrives, the jobtracker does not recognize it, 
> which causes the tasktracker to close and restart.  The bug is in the 
> tasktracker's close method, which uses an iterator in an unsafe way.
> 
> It should instead use something like:
> 
>    while (tasks.size() != 0) {
>      TaskInProgress tip = (TaskInProgress)tasks.first();
>      tip.jobHasFinished();
>    }
> 
> What's in the jobtracker logs around this time?  Did it report this 
> tasktracker as lost?

The jobtracker did not indicate such a thing (via an exception anyway).
Tasktracker connections seem to be established and disconnected from
fairly frequently. Perhaps this is what you mean?

> Was this all by chance at the start of an Indexer job?  That's when I've 
> seen this sort of thing.  This job has more input files than any other, 

Middle of a fetch actually.

-- 
Rod Taylor <[EMAIL PROTECTED]>

Reply via email to