Hi,

I was wondering if anyone has a simple script using Nutch 1.0 to crawl
an Intranet sites with multiple webservers.  I can use
/webroot/oscrawlers/nutch/bin/nutch crawl
/webroot/oscrawlers/nutch/urls/seed.txt -dir
/webroot/oscrawlers/nutch/crawl -depth 8 -topN 1000 and get a big
chunk of the files.  I then tried to follow the steps outlined on the
Nutch Tutorial, http://wiki.apache.org/nutch/NutchTutorial on crawling
"Whole-web" and nothing new seems to get into the index.  It seems to
be crawling the same URLs.  When I run the "-stats" command against
the database I get the same stats output.

Here is my script
####################################################
#!/bin/sh
####################################################
# nutch_crawler.sh
####################################################
echo "  Set UMASK ...";
umask 002;
echo ""

# Set Variables
LIMIT=1 # Max loops to execute
A=0
NUTCHBINARY='/webroot/oscrawlers/nutch/bin/nutch'
NUTCHDB='/webroot/oscrawlers/nutch/crawl/crawldb'
NUTCHSEGMENTS='/webroot/oscrawlers/nutch/crawl/segments'
NUTCHINDEXES='/webroot/oscrawlers/nutch/crawl/indexes'
NUTCHLINKDB='/webroot/oscrawlers/nutch/crawl/linkdb'

# Inject starting URLs into the database
#echo "  Injecting Starting URLs ..."
#echo ""
#$NUTCHBINARY inject $NUTCHDB /webroot/oscrawlers/nutch/urls/seed.txt
#sleep 30

while [ $A -le "$LIMIT" ]
do
        # Generate a fetch list
        echo "  Generating fetch list ..."
        $NUTCHBINARY generate $NUTCHDB $NUTCHSEGMENTS -topN 1000
        
        # Find the newest created segment
        echo ""
        echo "  Get segment ..."
        s1=`ls -d /webroot/oscrawlers/nutch/crawl/segments/2* | tail -1`
        echo ""
        echo "  Segment is: $s1 ..."
        
        # Fetch this segment
        $NUTCHBINARY fetch $s1
        
        # Add one to A and continue looping until LIMIT is reached
        A=$(($A+1))
        sleep 60
done

# Invert links
echo ""
echo "  Building inverted links ... "
$NUTCHBINARY invertlinks $NUTCHLINKDB -dir $NUTCHSEGMENTS

# Before I can do this, I need to delete the current indexes.  Doesn't
seem to affect the current searches
echo ""
echo "  Remove old indexes ..."
rm -rf $NUTCHINDEXES

# Index Segments
echo ""
echo "  Build new indexes ..."
$NUTCHBINARY index $NUTCHINDEXES $NUTCHDB $NUTCHLINKDB $NUTCHSEGMENTS/*
echo ""
echo "  Done ...";
###########################################################
Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.new.facebook.com/people/Jake_Jacobson/622727274

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS

Reply via email to