I have written this script to crawl with Nutch 0.9. Though, I have tried to take care that this should work for re-crawls as well, but I have never done any real world testing for re-crawls. I use this to crawl.
You may try this out. We can make some changes if this is not found to be appropriate for re-crawls. Regards, Susam Pal http://susam.in/ #!/bin/sh # Runs the Nutch bot to crawl or re-crawl # Usage: bin/runbot [safe] # If executed in 'safe' mode, it doesn't delete the temporary # directories generated during crawl. This might be helpful for # analysis and recovery in case a crawl fails. # # Author: Susam Pal depth=2 threads=50 adddays=5 topN=2 # Comment this statement if you don't want to set topN value # Parse arguments if [ "$1" == "safe" ] then safe=yes fi if [ -z "$NUTCH_HOME" ] then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z "$CATALINA_HOME" ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n "$topN" ] then topN="--topN $rank" else topN="" fi steps=8 echo "----- Inject (Step 1 of $steps) -----" $NUTCH_HOME/bin/nutch inject crawl/crawldb urls echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" for((i=0; i < $depth; i++)) do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN -adddays $adddays if [ $? -ne 0 ] then echo "runbot: Stopping at depth $depth. No more URLs to fetch." break fi segment=`ls -d crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "runbot: fetch $segment at depth $depth failed. Deleting it." rm -rf $segment continue fi $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment done echo "----- Merge Segments (Step 3 of $steps) -----" $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* if [ "$safe" != "yes" ] then rm -rf crawl/segments/* else mkdir crawl/FETCHEDsegments mv --verbose crawl/segments/* crawl/FETCHEDsegments fi mv --verbose crawl/MERGEDsegments/* crawl/segments rmdir crawl/MERGEDsegments echo "----- Invert Links (Step 4 of $steps) -----" $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/* echo "----- Index (Step 5 of $steps) -----" $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/* echo "----- Dedup (Step 6 of $steps) -----" $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes echo "----- Merge Indexes (Step 7 of $steps) -----" $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes if [ "$safe" != "yes" ] then rm -rf crawl/NEWindexes fi echo "----- Reloading index on the search site (Step 8 of $steps) -----" if [ "$safe" != "yes" ] then touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml echo Done! else echo runbot: Can not reload index in safe mode. echo runbot: Please reload it manually using the following command: echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml fi echo "runbot: FINISHED: Crawl completed!" On 8/9/07, Brian Demers <[EMAIL PROTECTED]> wrote: > All, > > Does anyone have an updated recrawl script for 0.9? > > Also, does anyone have a link that describes each phase of a crawl / > recrawl (for 0.9) > > it looks like it changes each version. I searched the wiki, but i am > still unclear. > > thanks >
