Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "Tutorial on incremental crawling" page has been changed by Gabriele Kahlout. The comment on this change is: . http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=3&rev2=4 -------------------------------------------------- Follow 2 script: 1. Abridged script using Solr; + + {{{ + #!/bin/sh + + # + # Created by Gabriele Kahlout on 27.03.11. + # The following script crawls the whole-web incrementally; Specifying a list of urls to crawl, nutch will continuously fetch $it_size urls from a specified list of urls, index and merge them with our whole-web index, so that they can be immediately searched, until all urls have been fetched. + # + # TO USE: + # 1. $ mv whole-web-crawling-incremental $NUTCH_HOME/whole-web-crawling-incremental + # 2. $ cd $NUTCH_HOME + # 3. $ chmod +x whole-web-crawling-incremental + # 4. $ ./whole-web-crawling-incremental + + # Usage: ./whole-web-crawling-incremental [it_seedsDir-path urls-to-fetch-per-iteration depth] + # Start + + rm -r crawl # fresh crawl + + seedsDir=$1 + it_size=$2 + depth=$3 + + indexedPlus1=1 #indexedPlus1 urls+1 because of tail. Never printed out + it_seedsDir="$seedsDir/it_seeds" + rm -r $it_seedsDir + mkdir $it_seedsDir + + allUrls=`cat $seedsDir/*url* | wc -l | sed -e "s/^ *//"` + echo $allUrls" urls to crawl" + + it_crawldb="crawl/crawldb" + + + while [[ $indexedPlus1 -le $allUrls ]] + do + rm $it_seedsDir/urls + tail -n+$indexedPlus1 $seedsDir/*url* | head -n$it_size > $it_seedsDir/urls + + bin/nutch inject $it_crawldb $it_seedsDir + i=0 + + while [[ $i -lt $depth ]] + do + cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size" + $cmd + output=`$cmd` + if [[ $output == *'0 records selected for fetching'* ]] + then + break; + fi + s1=`ls -d crawl/segments/2* | tail -1` + + bin/nutch fetch $s1 + + bin/nutch updatedb $it_crawldb $s1 + + bin/nutch invertlinks crawl/linkdb -dir crawl/segments + + bin/nutch solrindex http://localhost:8080/solr/ $it_crawldb crawl/linkdb crawl/segments/* + + ((i++)) + ((indexedPlus1+=$it_size)) + done + done + rm -r $it_seedsDir + + }}} 2. Unabridged script with explanations and using nutch index. == 1. Abridged script using Solr == == 2. Unabridged script with explanations and using nutch index == {{{ + #!/bin/sh # @@ -87, +156 @@ echo cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size" echo $cmd + $cmd output=`$cmd` echo $output if [[ $output == *'0 records selected for fetching'* ]] #all the urls of this iteration have been fetched @@ -149, +219 @@ rm -r $crawl_dump $it_seedsDir echoThenRun "bin/nutch readdb $allcrawldb -dump $crawl_dump" # you can inspect the dump with $ vim $crawl_dump bin/nutch readdb $allcrawldb -stats + }}}

