Hi, I was wondering if anyone has a simple script using Nutch 1.0 to crawl an Intranet sites with multiple webservers. I can use /webroot/oscrawlers/nutch/bin/nutch crawl /webroot/oscrawlers/nutch/urls/seed.txt -dir /webroot/oscrawlers/nutch/crawl -depth 8 -topN 1000 and get a big chunk of the files. I then tried to follow the steps outlined on the Nutch Tutorial, http://wiki.apache.org/nutch/NutchTutorial on crawling "Whole-web" and nothing new seems to get into the index. It seems to be crawling the same URLs. When I run the "-stats" command against the database I get the same stats output.
Here is my script #################################################### #!/bin/sh #################################################### # nutch_crawler.sh #################################################### echo " Set UMASK ..."; umask 002; echo "" # Set Variables LIMIT=1 # Max loops to execute A=0 NUTCHBINARY='/webroot/oscrawlers/nutch/bin/nutch' NUTCHDB='/webroot/oscrawlers/nutch/crawl/crawldb' NUTCHSEGMENTS='/webroot/oscrawlers/nutch/crawl/segments' NUTCHINDEXES='/webroot/oscrawlers/nutch/crawl/indexes' NUTCHLINKDB='/webroot/oscrawlers/nutch/crawl/linkdb' # Inject starting URLs into the database #echo " Injecting Starting URLs ..." #echo "" #$NUTCHBINARY inject $NUTCHDB /webroot/oscrawlers/nutch/urls/seed.txt #sleep 30 while [ $A -le "$LIMIT" ] do # Generate a fetch list echo " Generating fetch list ..." $NUTCHBINARY generate $NUTCHDB $NUTCHSEGMENTS -topN 1000 # Find the newest created segment echo "" echo " Get segment ..." s1=`ls -d /webroot/oscrawlers/nutch/crawl/segments/2* | tail -1` echo "" echo " Segment is: $s1 ..." # Fetch this segment $NUTCHBINARY fetch $s1 # Add one to A and continue looping until LIMIT is reached A=$(($A+1)) sleep 60 done # Invert links echo "" echo " Building inverted links ... " $NUTCHBINARY invertlinks $NUTCHLINKDB -dir $NUTCHSEGMENTS # Before I can do this, I need to delete the current indexes. Doesn't seem to affect the current searches echo "" echo " Remove old indexes ..." rm -rf $NUTCHINDEXES # Index Segments echo "" echo " Build new indexes ..." $NUTCHBINARY index $NUTCHINDEXES $NUTCHDB $NUTCHLINKDB $NUTCHSEGMENTS/* echo "" echo " Done ..."; ########################################################### Jake Jacobson http://www.linkedin.com/in/jakejacobson http://www.new.facebook.com/people/Jake_Jacobson/622727274 Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter. -- ANONYMOUS