[Nutch-general] Re: Recovering from Socket closed

Chris Schneider Sat, 11 Feb 2006 11:23:59 -0800

Gang,

Chris Schneider wrote:
I was unable to even get the indexing phase started; I would get atimeout right at the beginning. I tried increasing theipc.client.timeout from 5 minutes to 10 minutes, but that didn'thelp. In desperation, I increased it to 30 minutes and went to walkthe dogs. As it turned out, it apparently took 14 minutes for it to"compute the splits". The job is still running (34% complete).Thus, it does seem like Doug was right about this being the problem.
I have no idea why this takes so long. We should profile thisoperation to figure out what's going on, because it shouldn'tanywhere near that long. It should be easy to write a simpleprogram that constructs a JobConf and InputFormat like those used inthis job, and calls getSplits(). Then profile this as a standaloneprogram to see where the time is going. Probably you don't reallywant to profile something that takes 14 minutes, so perhaps profileit on a subset of the input.

Everyone should probably be made aware of the strange behavior we seeduring indexing, at least for a relatively large number of largesegments (topN=500K, depth=20) with a relatively large crawldb (50MURLs). Note that this was all performed with ipc.client.timeout setto 30 minutes.

After launching the indexing job, the web UI shows all of theTaskTrackers, but the numbers in the "Secs since heartbeat" columnjust keep increasing. This goes on for about 10 minutes until theJobTracker finally loses all of them (and the tasks they were workingon), as is shown in its log:


060210 224115 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060210 225151 Lost tracker 'tracker_37064'
060210 225151 Task 'task_m_4ftk58' has been lost.
060210 225151 Task 'task_m_6ww2ri' has been lost.

...(snip)...

060210 225151 Task 'task_r_y6d190' has been lost.
060210 225151 Lost tracker 'tracker_92921'
060210 225151 Task 'task_m_9p24at' has been lost.

...(etc)...

At this point, the web UI is still up, the job shows 0% complete, andthe TaskTrackers table is empty. It goes on for an hour or so likethis, during which any rational person would probably want to killthe job and start over.


Don't do this! Keep the faith!!!

About an hour later, the JobTracker magically reestablishes itsconnection to the TaskTrackers (which now have new names), as isshown in its log:


060210 225151 Task 'task_r_yj3y3o' has been lost.
060210 235403 Adding task 'task_m_k9u9a8' to set for tracker 'tracker_85874'
060210 235404 Adding task 'task_m_pijt4q' to set for tracker 'tracker_61888'

...(etc)...

The web UI also shows that the TaskTrackers are back (with their new names).

There's nothing in the TaskTracker logs during the initial 10minutes, then a bunch of exiting and closing messages, until finallythe TaskTrackers start "Reinitializing local state":


060210 225403 Stopping server on 50050
060210 230102 Server handler 4 on 50050: exiting

...(snip)...

060210 230105 Server handler 7 on 50050: exiting
060210 232024 Server listener on port 50050: exiting
060210 232403 Stopping server on 50040
060210 234902 Server listener on port 50040: exiting
060210 234925 Server connection on port 50040 from 192.168.1.5: exiting

...(snip)...

060210 235009 Server connection on port 50040 from 192.168.1.10: exiting
060210 235013 Client connection to 192.168.1.4:50040: closing
060210 235014 Client connection to 192.168.1.7:50040: closing
060210 235015 Server connection on port 50040 from 192.168.1.7: exiting
060210 235016 Server handler 0 on 50040: exiting

...(snip)...

060210 235024 Server handler 2 on 50040: exiting
060210 235403 Reinitializing local state
060210 235403 Server listener on port 50050: starting
060210 235403 Server handler 0 on 50050: starting

...(etc)...

During the time that the TaskTrackers are lost, neither the masternor the slave machines seem to be using much of the CPU or RAM, andthe DataNode logs are quiet. I suppose that it's probably I/O boundon the master machine, but even that seems mysterious to me. It wouldseem particularly inappropriate for the JobTracker to punt theTaskTrackers because the master was too busy to listen for theirheartbeats.

At any rate, once the TaskTrackers go through the "Reinitializinglocal state" thing, the indexing job seems to proceed normally, andit eventually completes with no errors.


FYI,

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------

[Nutch-general] Re: Recovering from Socket closed

Reply via email to