Gang,
Chris Schneider wrote:
I was unable to even get the indexing phase started; I would get a
timeout right at the beginning. I tried increasing the
ipc.client.timeout from 5 minutes to 10 minutes, but that didn't
help. In desperation, I increased it to 30 minutes and went to walk
the dogs. As it turned out, it apparently took 14 minutes for it to
"compute the splits". The job is still running (34% complete).
Thus, it does seem like Doug was right about this being the problem.
I have no idea why this takes so long. We should profile this
operation to figure out what's going on, because it shouldn't
anywhere near that long. It should be easy to write a simple
program that constructs a JobConf and InputFormat like those used in
this job, and calls getSplits(). Then profile this as a standalone
program to see where the time is going. Probably you don't really
want to profile something that takes 14 minutes, so perhaps profile
it on a subset of the input.
Everyone should probably be made aware of the strange behavior we see
during indexing, at least for a relatively large number of large
segments (topN=500K, depth=20) with a relatively large crawldb (50M
URLs). Note that this was all performed with ipc.client.timeout set
to 30 minutes.
After launching the indexing job, the web UI shows all of the
TaskTrackers, but the numbers in the "Secs since heartbeat" column
just keep increasing. This goes on for about 10 minutes until the
JobTracker finally loses all of them (and the tasks they were working
on), as is shown in its log:
060210 224115 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060210 225151 Lost tracker 'tracker_37064'
060210 225151 Task 'task_m_4ftk58' has been lost.
060210 225151 Task 'task_m_6ww2ri' has been lost.
...(snip)...
060210 225151 Task 'task_r_y6d190' has been lost.
060210 225151 Lost tracker 'tracker_92921'
060210 225151 Task 'task_m_9p24at' has been lost.
...(etc)...
At this point, the web UI is still up, the job shows 0% complete, and
the TaskTrackers table is empty. It goes on for an hour or so like
this, during which any rational person would probably want to kill
the job and start over.
Don't do this! Keep the faith!!!
About an hour later, the JobTracker magically reestablishes its
connection to the TaskTrackers (which now have new names), as is
shown in its log:
060210 225151 Task 'task_r_yj3y3o' has been lost.
060210 235403 Adding task 'task_m_k9u9a8' to set for tracker 'tracker_85874'
060210 235404 Adding task 'task_m_pijt4q' to set for tracker 'tracker_61888'
...(etc)...
The web UI also shows that the TaskTrackers are back (with their new names).
There's nothing in the TaskTracker logs during the initial 10
minutes, then a bunch of exiting and closing messages, until finally
the TaskTrackers start "Reinitializing local state":
060210 225403 Stopping server on 50050
060210 230102 Server handler 4 on 50050: exiting
...(snip)...
060210 230105 Server handler 7 on 50050: exiting
060210 232024 Server listener on port 50050: exiting
060210 232403 Stopping server on 50040
060210 234902 Server listener on port 50040: exiting
060210 234925 Server connection on port 50040 from 192.168.1.5: exiting
...(snip)...
060210 235009 Server connection on port 50040 from 192.168.1.10: exiting
060210 235013 Client connection to 192.168.1.4:50040: closing
060210 235014 Client connection to 192.168.1.7:50040: closing
060210 235015 Server connection on port 50040 from 192.168.1.7: exiting
060210 235016 Server handler 0 on 50040: exiting
...(snip)...
060210 235024 Server handler 2 on 50040: exiting
060210 235403 Reinitializing local state
060210 235403 Server listener on port 50050: starting
060210 235403 Server handler 0 on 50050: starting
...(etc)...
During the time that the TaskTrackers are lost, neither the master
nor the slave machines seem to be using much of the CPU or RAM, and
the DataNode logs are quiet. I suppose that it's probably I/O bound
on the master machine, but even that seems mysterious to me. It would
seem particularly inappropriate for the JobTracker to punt the
TaskTrackers because the master was too busy to listen for their
heartbeats.
At any rate, once the TaskTrackers go through the "Reinitializing
local state" thing, the indexing job seems to proceed normally, and
it eventually completes with no errors.
FYI,
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------