Ken Krugler wrote: > Hello fellow Nutchers, > > I followed the steps described here by Doug: > > <http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL > PROTECTED]> > > > ...to start a test run of the new (0.8, as of 1/12/2006) version of > Nutch. > > It ran for quite a while on my three machines - started at 111226, and > died at 150937, so almost four hours. > > The error occurred during the Indexer phase: > > 060114 150937 Indexer: starting > 060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb > 060114 150937 Indexer: adding segment: > /user/crawler/crawl-20060114111226/segments/20060114111918 > 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml > 060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml > 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml > 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml > 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml > 060114 150937 Indexer: adding segment: > /user/crawler/crawl-20060114111226/segments/20060114122751 > 060114 150937 Indexer: adding segment: > /user/crawler/crawl-20060114111226/segments/20060114133620 > Exception in thread "main" java.io.IOException: timed out waiting for > response > at org.apache.nutch.ipc.Client.call(Client.java:296) > at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) > at $Proxy1.submitJob(Unknown Source) > at > org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) > at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) > at org.apache.nutch.indexer.Indexer.index(Indexer.java:259) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) > > 1. Any ideas what might have caused it to time out just now, when it > had successfully run many jobs up to that point? > > 2. What cruft might I need to get rid of because it died? For example, > I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml > now when I try to execute some Nutch commands.
I've had the same problem during the invertlinks step when dealing w/ a large number of urls. Increasing the ipc.client.timeout value from 60000 to 100000 (cf nutch-default.xml) did the trick. > > 3. What's the best way to find out how many pages were actually > crawled, how many links are in the DB, etc? The 0.7-era commands > (readdb, segread, etc) don't seem to be working with the new NDFS setup. The following gives you some stats about the crawl db (#url fetched, unfetched and "dead" ones): nutch readdb crawldb -stats > > 4. Any idea whether 4 hours is a reasonable amount of time for this > test? It seemed long to me, given that I was starting with a single > URL as the seed. > How many crawl passes did you do ? --Flo
