Hello fellow Nutchers,
I followed the steps described here by Doug:
<http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL
PROTECTED]>
...to start a test run of the new (0.8, as of 1/12/2006) version of Nutch.
It ran for quite a while on my three machines - started at 111226,
and died at 150937, so almost four hours.
The error occurred during the Indexer phase:
060114 150937 Indexer: starting
060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb
060114 150937 Indexer: adding segment:
/user/crawler/crawl-20060114111226/segments/20060114111918
060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060114 150937 Indexer: adding segment:
/user/crawler/crawl-20060114111226/segments/20060114122751
060114 150937 Indexer: adding segment:
/user/crawler/crawl-20060114111226/segments/20060114133620
Exception in thread "main" java.io.IOException: timed out waiting for response
at org.apache.nutch.ipc.Client.call(Client.java:296)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
1. Any ideas what might have caused it to time out just now, when it
had successfully run many jobs up to that point?
2. What cruft might I need to get rid of because it died? For
example, I see a reference to
/home/crawler/tmp/local/jobTracker/job_18cunz.xml now when I try to
execute some Nutch commands.
3. What's the best way to find out how many pages were actually
crawled, how many links are in the DB, etc? The 0.7-era commands
(readdb, segread, etc) don't seem to be working with the new NDFS
setup.
4. Any idea whether 4 hours is a reasonable amount of time for this
test? It seemed long to me, given that I was starting with a single
URL as the seed.
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200