Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "Tutorial on incremental crawling" page has been changed by Gabriele Kahlout. http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=1&rev2=2 -------------------------------------------------- If not ready, follow [[Tutorial]] to setup and configure Nutch on your machine. - It also works with Solr. If you have Solr setup + Follow 2 script: - {{{ + 1. Abridged script using Solr; + + 2. Unabridged script with explanations and using nutch index. + + == 1. Abridged script using Solr == + == 2. Unabridged script with explanations and using nutch index == - #!/bin/sh + {{{#!/bin/sh # # Created by Gabriele Kahlout on 27.03.11. - # + # - # The following script crawls the whole-web incrementally; Specifying a list of urls to crawl, nutch will continuously fetch $it_size urls from a + # The following script crawls the whole-web incrementally; Specifying a list of urls to crawl, nutch will continuously fetch $it_size urls from a # specified list of urls, index and merge them with our whole-web index, so that they can be immediately searched, until all urls have been fetched. # # Usage: ./whole-web-crawling-incremental [it_seedsDir-path urls-to-fetch-per-iteration depth] @@ -23, +28 @@ # 2. $ cd $NUTCH_HOME # 3. $ chmod +x whole-web-crawling-incremental # 4. $ ./whole-web-crawling-incremental - # + # # Start function echoThenRun () { # echo and then run the command echo $1 @@ -69, +74 @@ do echo echo "generate-fetch-updatedb-invertlinks-index-merge iteration "$i":" + echo - echoThenRun "bin/nutch generate $it_crawldb crawl/segments -topN $it_size" + cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size" - output=`$cmd` - echo $output + echo $cmd + output=`$cmd` + echo $output if [[ $output == *'0 records selected for fetching'* ]] #all the urls of this iteration have been fetched then break;

