Error at end of MapReduce run with indexing

Ken Krugler Sat, 14 Jan 2006 17:03:31 -0800

Hello fellow Nutchers,

I followed the steps described here by Doug:
 <http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL 
PROTECTED]>


...to start a test run of the new (0.8, as of 1/12/2006) version of Nutch.

It ran for quite a while on my three machines - started at 111226,and died at 150937, so almost four hours.


The error occurred during the Indexer phase:

060114 150937 Indexer: starting
060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb

060114 150937 Indexer: adding segment:/user/crawler/crawl-20060114111226/segments/20060114111918

060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml

060114 150937 Indexer: adding segment:/user/crawler/crawl-20060114111226/segments/20060114122751060114 150937 Indexer: adding segment:/user/crawler/crawl-20060114111226/segments/20060114133620

Exception in thread "main" java.io.IOException: timed out waiting for response
        at org.apache.nutch.ipc.Client.call(Client.java:296)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy1.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

1. Any ideas what might have caused it to time out just now, when ithad successfully run many jobs up to that point?

2. What cruft might I need to get rid of because it died? Forexample, I see a reference to/home/crawler/tmp/local/jobTracker/job_18cunz.xml now when I try toexecute some Nutch commands.

3. What's the best way to find out how many pages were actuallycrawled, how many links are in the DB, etc? The 0.7-era commands(readdb, segread, etc) don't seem to be working with the new NDFSsetup.

4. Any idea whether 4 hours is a reasonable amount of time for thistest? It seemed long to me, given that I was starting with a singleURL as the seed.


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Error at end of MapReduce run with indexing

Reply via email to