you try to refetch an already fetched segment. if in your loop bin/nutch generate crawl/crawldb crawl/segments -topN 1000
does not generate a new segment, this can happen. you have to check whether a new segment is generated by this command. check the exit status. there are some scripts in nutch wiki which do this. ksee schrieb: > I don't really understand why running nutch in that cycle > (generate/fetch/update) should affect my search versus using "nutch crawl". > I read on the docs, that nutch crawl should suffice for any site with less > than 1 million pages. Our wiki has at most a few hundred. > > But I did try to create a script and run it. I'm still not having much > success though. > Here is my script: > { > #!/bin/sh > > JAVA_HOME=/cygdrive/d/Java/jdk1.6.0_18/ > > LOOP=100 > > bin/nutch inject crawl/crawldb urls > > #### Fetching > bin/nutch generate crawl/crawldb crawl/segments > > s1=`ls -d crawl/segments/2* | tail -1` > echo $s1 > > bin/nutch fetch $s1 > > bin/nutch updatedb crawl/crawldb $s1 > > for i in `seq 1 $LOOP`; > do > bin/nutch generate crawl/crawldb crawl/segments -topN 1000 > > s2=`ls -d crawl/segments/2* | tail -1` > echo $s2 > > bin/nutch fetch $s2 > > bin/nutch updatedb crawl/crawldb $s2 > done > > > #### Indexing > bin/nutch invertlinks crawl/linkdb -dir crawl/segments > > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* > } > > Unfortunately, I think I'm doing something wrong. In my Fetching loop, I'm > getting some errors: > Fetcher: segment: crawl/segments/20100325152823 > Exception in thread "main" java.io.IOException: Segment already fetched! > at > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(Fetcher > OutputFormat.java:50) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003) > CrawlDb update: starting > > > Can anyone explain to me why the nutch cycle (generate/fetch/update) is more > appropriate than the nutch crawl for my company's small wiki site? > Also, can someone point fix my bash script or point me to a standard one? > > Thanks, > Kane > > > > > Chris Laif wrote: > >> On Wed, Mar 17, 2010 at 11:36 PM, ksee <k...@fetch.com> wrote: >> >>> Does anyone have any suggestions at all? >>> I'm still desperately searching for a solution. I'm open to even obvious >>> suggestions/checks that I may have overlooked. >>> >> Please have a look at one of the standard nutch crawl (bash-)scripts >> you can find on the web. You have to start nutch multiple times in a >> row (generate/fetch/update cycle). >> >> Chris >> >> >> > >