Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "Tutorial on incremental crawling" page has been changed by Gabriele Kahlout. http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=7&rev2=8 -------------------------------------------------- # # Created by Gabriele Kahlout on 27.03.11. # The following script crawls the whole-web incrementally; Specifying a list of urls to crawl, nutch will continuously fetch $it_size urls from a specified list of urls, index and merge them with our whole-web index, so that they can be immediately searched, until all urls have been fetched. + # It assumes that you have setup Solr and it's running on port 8080. # # TO USE: # 1. $ mv whole-web-crawling-incremental $NUTCH_HOME/whole-web-crawling-incremental # 2. $ cd $NUTCH_HOME # 3. $ chmod +x whole-web-crawling-incremental - # 4. $ ./whole-web-crawling-incremental seeds 5 2 + # 4. $ ./whole-web-crawling-incremental - # Usage: ./whole-web-crawling-incremental it_seedsDir-path urls-to-fetch-per-iteration depth + # Usage: ./whole-web-crawling-incremental [it_seedsDir-path urls-to-fetch-per-iteration depth] # Start - rm -r crawl # fresh crawl + rm -r crawl seedsDir=$1 it_size=$2 @@ -41, +42 @@ mkdir $it_seedsDir allUrls=`cat $seedsDir/*url* | wc -l | sed -e "s/^ *//"` - echo $allUrls" urls to crawl" it_crawldb="crawl/crawldb"

