I have written this script to crawl with Nutch 0.9. Though, I have
tried to take care that this should work for re-crawls as well, but I
have never done any real world testing for re-crawls. I use this to
crawl.

You may try this out. We can make some changes if this is not found to
be appropriate for re-crawls.

Regards,
Susam Pal
http://susam.in/

#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
#        If executed in 'safe' mode, it doesn't delete the temporary
#        directories generated during crawl. This might be helpful for
#        analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value

# Parse arguments
if [ "$1" == "safe" ]
then
  safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi

if [ -z "$CATALINA_HOME" ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
  echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi

if [ -n "$topN" ]
then
  topN="--topN $rank"
else
  topN=""
fi

steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN
-adddays $adddays
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth $depth failed. Deleting it."
    rm -rf $segment
    continue
  fi

  $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done

echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
  rm -rf crawl/segments/*
else
  mkdir crawl/FETCHEDsegments
  mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi

mv --verbose crawl/MERGEDsegments/* crawl/segments
rmdir crawl/MERGEDsegments

echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes

if [ "$safe" != "yes" ]
then
  rm -rf crawl/NEWindexes
fi

echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
  touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
  echo Done!
else
  echo runbot: Can not reload index in safe mode.
  echo runbot: Please reload it manually using the following command:
  echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi

echo "runbot: FINISHED: Crawl completed!"

On 8/9/07, Brian Demers <[EMAIL PROTECTED]> wrote:
> All,
>
> Does anyone have an updated recrawl script for 0.9?
>
> Also, does anyone have a link that describes each phase of a crawl /
> recrawl (for 0.9)
>
> it looks like it changes each version.  I searched the wiki, but i am
> still unclear.
>
> thanks
>

Reply via email to