Re: problem crawling entire internal website

ksee Thu, 25 Mar 2010 15:44:06 -0700

I don't really understand why running nutch in that cycle
(generate/fetch/update) should affect my search versus using "nutch crawl".
I read on the docs, that nutch crawl should suffice for any site with less
than 1 million pages. Our wiki has at most a few hundred.


But I did try to create a script and run it. I'm still not having much
success though.
Here is my script:
{
#!/bin/sh

JAVA_HOME=/cygdrive/d/Java/jdk1.6.0_18/

LOOP=100

bin/nutch inject crawl/crawldb urls

#### Fetching
bin/nutch generate crawl/crawldb crawl/segments

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

bin/nutch fetch $s1

bin/nutch updatedb crawl/crawldb $s1

for i in `seq 1 $LOOP`;
do
bin/nutch generate crawl/crawldb crawl/segments -topN 1000

s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2

bin/nutch updatedb crawl/crawldb $s2
done


#### Indexing
bin/nutch invertlinks crawl/linkdb -dir crawl/segments

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
}

Unfortunately, I think I'm doing something wrong. In my Fetching loop, I'm
getting some errors:
Fetcher: segment: crawl/segments/20100325152823
Exception in thread "main" java.io.IOException: Segment already fetched!
        at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(Fetcher
OutputFormat.java:50)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003)
CrawlDb update: starting


Can anyone explain to me why the nutch cycle (generate/fetch/update) is more
appropriate than the nutch crawl for my company's small wiki site?
Also, can someone point fix my bash script or point me to a standard one?

Thanks,
Kane




Chris Laif wrote:
> 
> On Wed, Mar 17, 2010 at 11:36 PM, ksee <k...@fetch.com> wrote:
>>
>> Does anyone have any suggestions at all?
>> I'm still desperately searching for a solution. I'm open to even obvious
>> suggestions/checks that I may have overlooked.
> 
> Please have a look at one of the standard nutch crawl (bash-)scripts
> you can find on the web. You have to start nutch multiple times in a
> row (generate/fetch/update cycle).
> 
> Chris
> 
> 

-- 
View this message in context: 
http://old.nabble.com/problem-crawling-entire-internal-website-tp27908943p28035952.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: problem crawling entire internal website

Reply via email to