Dear wiki user, You have subscribed to a wiki page "Nutch Wiki" for change notification.
The page Crawl has been reverted to revision 10 by cmd. http://wiki.apache.org/nutch/Crawl?action=diff&rev1=11&rev2=12 -------------------------------------------------- The complete job of this script has been divided broadly into 8 steps. 1. Inject URLs - 1. Generate, Fetch, Parse, Update Loop + 2. Generate, Fetch, Parse, Update Loop - 1. Merge Segments + 3. Merge Segments - 1. Invert Links + 4. Invert Links - 1. Index + 5. Index - 1. Dedup + 6. Dedup - 1. Merge Indexes + 7. Merge Indexes - 1. Load new indexes + 8. Load new indexes == Modes of Execution == The script can be executed in two modes:- - * Normal Mode * Safe Mode @@ -43, +42 @@ then NUTCH_HOME=. }}} + Set 'NUTCH_HOME' to the path of the Nutch directory (if you are not setting it as an environment variable, since if environment variable is set, the above assignment is ignored). === CATALINA_HOME === @@ -53, +53 @@ then CATALINA_HOME=/opt/apache-tomcat-6.0.10 }}} + Similar to the previous section, if this variable is set in the environment, then the above assignment is ignored. == Can it re-crawl? == The author has used this script to re-crawl a couple of times. However, no real world testing has been done for re-crawling. Therefore, you may try to use the script for re-crawl. If it works fine or it doesn't work properly for re-crawl, please let us know. == Script == + {{{ - {{{#!/bin/sh + #!/bin/sh # runbot script to run the Nutch bot for crawling and re-crawling. # Usage: bin/runbot [safe] @@ -88, +90 @@ then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME - echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script + echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else - echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME + echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z "$CATALINA_HOME" ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME - echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script + echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else - echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME + echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n "$topN" ]

