Ken Krugler wrote:

> Hello fellow Nutchers,
>
> I followed the steps described here by Doug:
>  
> <http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL
>  PROTECTED]>
>
>
> ...to start a test run of the new (0.8, as of 1/12/2006) version of
> Nutch.
>
> It ran for quite a while on my three machines - started at 111226, and
> died at 150937, so almost four hours.
>
> The error occurred during the Indexer phase:
>
> 060114 150937 Indexer: starting
> 060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb
> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114111918
> 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
> 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml
> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114122751
> 060114 150937 Indexer: adding segment:
> /user/crawler/crawl-20060114111226/segments/20060114133620
> Exception in thread "main" java.io.IOException: timed out waiting for
> response
>         at org.apache.nutch.ipc.Client.call(Client.java:296)
>         at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>         at $Proxy1.submitJob(Unknown Source)
>         at
> org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>         at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
>
> 1. Any ideas what might have caused it to time out just now, when it
> had successfully run many jobs up to that point?
>
> 2. What cruft might I need to get rid of because it died? For example,
> I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml
> now when I try to execute some Nutch commands.

I've had the same problem during the invertlinks step when dealing w/ a
large number of urls.  Increasing the ipc.client.timeout value from
60000  to 100000 (cf nutch-default.xml) did the trick.

>
> 3. What's the best way to find out how many pages were actually
> crawled, how many links are in the DB, etc? The 0.7-era commands
> (readdb, segread, etc) don't seem to be working with the new NDFS setup.

The following gives you some stats about the crawl db (#url fetched,
unfetched and "dead" ones):
nutch readdb crawldb -stats

>
> 4. Any idea whether 4 hours is a reasonable amount of time for this
> test? It seemed long to me, given that I was starting with a single
> URL as the seed.
>
How many crawl passes did you do ?

--Flo

Reply via email to