Hello fellow Nutchers,

I followed the steps described here by Doug:
 <http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL 
PROTECTED]>

...to start a test run of the new (0.8, as of 1/12/2006) version of Nutch.

It ran for quite a while on my three machines - started at 111226, and died at 150937, so almost four hours.

The error occurred during the Indexer phase:

060114 150937 Indexer: starting
060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb
060114 150937 Indexer: adding segment: /user/crawler/crawl-20060114111226/segments/20060114111918
060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060114 150937 Indexer: adding segment: /user/crawler/crawl-20060114111226/segments/20060114122751 060114 150937 Indexer: adding segment: /user/crawler/crawl-20060114111226/segments/20060114133620
Exception in thread "main" java.io.IOException: timed out waiting for response
        at org.apache.nutch.ipc.Client.call(Client.java:296)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy1.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

1. Any ideas what might have caused it to time out just now, when it had successfully run many jobs up to that point?

2. What cruft might I need to get rid of because it died? For example, I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml now when I try to execute some Nutch commands.

3. What's the best way to find out how many pages were actually crawled, how many links are in the DB, etc? The 0.7-era commands (readdb, segread, etc) don't seem to be working with the new NDFS setup.

4. Any idea whether 4 hours is a reasonable amount of time for this test? It seemed long to me, given that I was starting with a single URL as the seed.

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to