Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/Crawl

The comment on the change is:
changes in script for Nutch 1.0-dev

------------------------------------------------------------------------------
   * Safe Mode
  
  === Normal Mode ===
- If the script is executed with the command 'bin/runbot', it will delete all 
the directories such as fetched segments, generated indexes, etc, so as to save 
space. It will also reload the index after it finishes crawling and the new 
crawl DB would go live.
+ If the script is executed with the command 'bin/runbot', it will delete all 
the directories such as fetched segments, generated indexes, etc, so as to save 
space.
  
- '''Caution:''' This also means that if something has gone wrong during the 
crawl and the resultant crawl DB is corrupt or incomplete, it might not return 
results for any query. And since this crawl DB would go live in 'normal mode', 
your visitors may not see any results.
+ '''Caution:''' This also means that if something has gone wrong during the 
crawl and the resultant crawl DB is corrupt or incomplete, there is no way any 
recovery action can be taken.
  
  === Safe Mode ===
+ Alternatively, the script can be executed in safe mode as 'bin/runbot safe' 
which will prevent deletion of these directories. All important temporary 
directories would be backed up with the prefix BACKUP. e.g. 
crawl/BACKUPsegments, crawl/BACKUPindexes, crawl/BACKUPindex. If errors occur, 
you can take recovery action because the directories haven't been deleted. You 
can then manually merge the segments, generate indexes, etc. from the 
directories and reload the index.
- Alternatively, the script can be executed in safe mode as 'bin/runbot safe' 
which will prevent deletion of these directories.
- If errors occur, you can take recovery action because the directories haven't 
been deleted. You can then manually merge the segments, generate indexes, etc. 
from the directories and make the resultant crawl DB go live.
- 
- Safe Mode also suppresses the automatic reloading of the new index. 
Therefore, the resultant crawl DB does not go live immediately after crawling. 
This gives you a chance to first test the new crawl DB for valid results. If it 
is found to work, you can make this new DB go live.
  
  === Normal Mode vs. Safe Mode ===
  Ideally, you should run the script in safe mode a couple of times, to make 
sure the crawl is running fine. If you are sure, that everything will go fine, 
you need not run it in safe mode.
@@ -61, +58 @@

  == Script ==
  {{{
  #!/bin/sh
+ 
+ # Author: Susam Pal
+ #
+ # 'runbot' script to crawl and re-crawl using Nutch 0.9 and Nutch 1.0
+ #
+ # Modify the values of the variables in the beginning to alter the
+ # behaviour of the script. The script accepts only one argument 'safe'
+ # to run the script in safe mode. e.g. bin/runbot safe
+ # Safe mode prevents deletion of temporary directories so that recovery
+ # action can be taken if anything goes wrong during the crawl.
+ 
- depth=8
+ depth=2
- threads=50
+ threads=5
  adddays=5
- topN=1000 #Comment this statement if you don't want to set topN value
+ topN=15 #Comment this statement if you don't want to set topN value
+ 
+ # Arguments for rm and mv
+ RMARGS="-rf"
+ MVARGS="--verbose"
  
  # Parse arguments
  if [ "$1" == "safe" ]
@@ -105, +117 @@

  for((i=0; i < $depth; i++))
  do
    echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
-   $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN -adddays 
$adddays
+   $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN \
+       -adddays $adddays
    if [ $? -ne 0 ]
    then
      echo "runbot: Stopping at depth $depth. No more URLs to fetch."
@@ -116, +129 @@

    $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
    if [ $? -ne 0 ]
    then
-     echo "runbot: fetch $segment at depth `expr $i + 1` failed. Deleting 
segment $segment."
+     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
+     echo "runbot: Deleting segment $segment."
-     rm -rf $segment
+     rm $RMARGS $segment
      continue
    fi
  
-   #$NUTCH_HOME/bin/nutch parse $segment
    $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
  done
  
@@ -129, +142 @@

  $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
  if [ "$safe" != "yes" ]
  then
-   rm -rf crawl/segments
+   rm $RMARGS crawl/segments
  else
+   rm $RMARGS crawl/BACKUPsegments
-   mv $MVARGS crawl/segments crawl/FETCHEDsegments
+   mv $MVARGS crawl/segments crawl/BACKUPsegments
  fi
  
  mv $MVARGS crawl/MERGEDsegments crawl/segments
@@ -140, +154 @@

  $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*
  
  echo "----- Index (Step 5 of $steps) -----"
- $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb 
crawl/segments/*
+ $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb \
+     crawl/segments/*
  
  echo "----- Dedup (Step 6 of $steps) -----"
  $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
  
  echo "----- Merge Indexes (Step 7 of $steps) -----"
- $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
+ $NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes
+ 
+ echo "----- Loading New Index (Step 8 of $steps) -----"
+ ${CATALINA_HOME}/bin/shutdown.sh
  
  if [ "$safe" != "yes" ]
  then
-   rm -rf crawl/NEWindexes
+   rm $RMARGS crawl/NEWindexes
+   rm $RMARGS crawl/index
+ else
+   rm $RMARGS crawl/BACKUPindexes
+   rm $RMARGS crawl/BACKUPindex
+   mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes
+   mv $MVARGS crawl/index crawl/BACKUPindex
  fi
  
+ mv $MVARGS crawl/NEWindex crawl/index
+ 
+ ${CATALINA_HOME}/bin/startup.sh
- echo "----- Reloading index on the search site (Step 8 of $steps) -----"
- if [ "$safe" != "yes" ]
- then
-   touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
-   echo Done!
- else
-   echo runbot: Can not reload index in safe mode.
-   echo runbot: Please reload it manually using the following command:
-   echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
- fi
  
  echo "runbot: FINISHED: Crawl completed!"
+ echo ""
  }}}
  

Reply via email to