Re: problem crawling entire internal website

reinhard schwab Fri, 26 Mar 2010 09:55:07 -0700

you try to refetch an already fetched segment.
if in your loop

bin/nutch generate crawl/crawldb crawl/segments -topN 1000



does not generate a new segment, this can happen.
you have to check whether a new segment is generated by this command.
check the exit status.
there are some scripts in nutch wiki which do this.

ksee schrieb:
> I don't really understand why running nutch in that cycle
> (generate/fetch/update) should affect my search versus using "nutch crawl".
> I read on the docs, that nutch crawl should suffice for any site with less
> than 1 million pages. Our wiki has at most a few hundred.
>
> But I did try to create a script and run it. I'm still not having much
> success though.
> Here is my script:
> {
> #!/bin/sh
>
> JAVA_HOME=/cygdrive/d/Java/jdk1.6.0_18/
>
> LOOP=100
>
> bin/nutch inject crawl/crawldb urls
>
> #### Fetching
> bin/nutch generate crawl/crawldb crawl/segments
>
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
>
> bin/nutch fetch $s1
>
> bin/nutch updatedb crawl/crawldb $s1
>
> for i in `seq 1 $LOOP`;
> do
> bin/nutch generate crawl/crawldb crawl/segments -topN 1000
>
> s2=`ls -d crawl/segments/2* | tail -1`
> echo $s2
>
> bin/nutch fetch $s2
>
> bin/nutch updatedb crawl/crawldb $s2
> done
>
>
> #### Indexing
> bin/nutch invertlinks crawl/linkdb -dir crawl/segments
>
> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
> }
>
> Unfortunately, I think I'm doing something wrong. In my Fetching loop, I'm
> getting some errors:
> Fetcher: segment: crawl/segments/20100325152823
> Exception in thread "main" java.io.IOException: Segment already fetched!
>         at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(Fetcher
> OutputFormat.java:50)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003)
> CrawlDb update: starting
>
>
> Can anyone explain to me why the nutch cycle (generate/fetch/update) is more
> appropriate than the nutch crawl for my company's small wiki site?
> Also, can someone point fix my bash script or point me to a standard one?
>
> Thanks,
> Kane
>
>
>
>
> Chris Laif wrote:
>   
>> On Wed, Mar 17, 2010 at 11:36 PM, ksee <k...@fetch.com> wrote:
>>     
>>> Does anyone have any suggestions at all?
>>> I'm still desperately searching for a solution. I'm open to even obvious
>>> suggestions/checks that I may have overlooked.
>>>       
>> Please have a look at one of the standard nutch crawl (bash-)scripts
>> you can find on the web. You have to start nutch multiple times in a
>> row (generate/fetch/update cycle).
>>
>> Chris
>>
>>
>>     
>
>

Re: problem crawling entire internal website

Reply via email to