I don't really understand why running nutch in that cycle (generate/fetch/update) should affect my search versus using "nutch crawl". I read on the docs, that nutch crawl should suffice for any site with less than 1 million pages. Our wiki has at most a few hundred.
But I did try to create a script and run it. I'm still not having much success though. Here is my script: { #!/bin/sh JAVA_HOME=/cygdrive/d/Java/jdk1.6.0_18/ LOOP=100 bin/nutch inject crawl/crawldb urls #### Fetching bin/nutch generate crawl/crawldb crawl/segments s1=`ls -d crawl/segments/2* | tail -1` echo $s1 bin/nutch fetch $s1 bin/nutch updatedb crawl/crawldb $s1 for i in `seq 1 $LOOP`; do bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 bin/nutch updatedb crawl/crawldb $s2 done #### Indexing bin/nutch invertlinks crawl/linkdb -dir crawl/segments bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* } Unfortunately, I think I'm doing something wrong. In my Fetching loop, I'm getting some errors: Fetcher: segment: crawl/segments/20100325152823 Exception in thread "main" java.io.IOException: Segment already fetched! at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(Fetcher OutputFormat.java:50) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003) CrawlDb update: starting Can anyone explain to me why the nutch cycle (generate/fetch/update) is more appropriate than the nutch crawl for my company's small wiki site? Also, can someone point fix my bash script or point me to a standard one? Thanks, Kane Chris Laif wrote: > > On Wed, Mar 17, 2010 at 11:36 PM, ksee <k...@fetch.com> wrote: >> >> Does anyone have any suggestions at all? >> I'm still desperately searching for a solution. I'm open to even obvious >> suggestions/checks that I may have overlooked. > > Please have a look at one of the standard nutch crawl (bash-)scripts > you can find on the web. You have to start nutch multiple times in a > row (generate/fetch/update cycle). > > Chris > > -- View this message in context: http://old.nabble.com/problem-crawling-entire-internal-website-tp27908943p28035952.html Sent from the Nutch - User mailing list archive at Nabble.com.