[Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout

Apache Wiki Sun, 27 Mar 2011 05:52:35 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Tutorial on incremental crawling" page has been changed by Gabriele 
Kahlout.
The comment on this change is:   .
http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=3&rev2=4

--------------------------------------------------

  Follow 2 script:
  
  1. Abridged script using Solr;
+ 
+ {{{
+ #!/bin/sh
+ 
+ #
+ # Created by Gabriele Kahlout on 27.03.11.
+ # The following script crawls the whole-web incrementally; Specifying a list 
of urls to crawl, nutch will continuously fetch $it_size urls from a specified 
list of urls, index and merge them with our whole-web index,  so that they can 
be immediately searched, until all urls have been fetched.
+ #
+ # TO USE:
+ # 1. $ mv whole-web-crawling-incremental 
$NUTCH_HOME/whole-web-crawling-incremental
+ # 2. $ cd $NUTCH_HOME
+ # 3. $ chmod +x whole-web-crawling-incremental
+ # 4. $ ./whole-web-crawling-incremental
+ 
+ # Usage: ./whole-web-crawling-incremental [it_seedsDir-path 
urls-to-fetch-per-iteration depth]
+ # Start
+ 
+ rm -r crawl # fresh crawl
+ 
+ seedsDir=$1
+ it_size=$2
+ depth=$3
+ 
+ indexedPlus1=1 #indexedPlus1 urls+1 because of tail. Never printed out
+ it_seedsDir="$seedsDir/it_seeds"
+ rm -r $it_seedsDir
+ mkdir $it_seedsDir
+ 
+ allUrls=`cat $seedsDir/*url* | wc -l | sed -e "s/^ *//"`
+ echo $allUrls" urls to crawl"
+ 
+ it_crawldb="crawl/crawldb"
+ 
+ 
+ while [[ $indexedPlus1 -le $allUrls ]]
+ do
+       rm $it_seedsDir/urls
+       tail -n+$indexedPlus1 $seedsDir/*url* | head -n$it_size > 
$it_seedsDir/urls
+       
+       bin/nutch inject $it_crawldb $it_seedsDir
+       i=0
+       
+       while [[ $i -lt $depth ]]
+       do              
+               cmd="bin/nutch generate $it_crawldb crawl/segments -topN 
$it_size"
+               $cmd
+               output=`$cmd`
+               if [[ $output == *'0 records selected for fetching'* ]]
+               then
+                       break;
+               fi
+               s1=`ls -d crawl/segments/2* | tail -1`
+ 
+               bin/nutch fetch $s1
+ 
+               bin/nutch updatedb $it_crawldb $s1
+ 
+               bin/nutch invertlinks crawl/linkdb -dir crawl/segments
+ 
+               bin/nutch solrindex http://localhost:8080/solr/ $it_crawldb 
crawl/linkdb crawl/segments/*
+                               
+               ((i++))
+               ((indexedPlus1+=$it_size))
+       done
+ done
+ rm -r $it_seedsDir
+ 
+ }}}
  
  2. Unabridged script with explanations and using nutch index.
  
  == 1. Abridged script using Solr ==
  == 2. Unabridged script with explanations and using nutch index ==
  {{{
+ 
  #!/bin/sh
  
  #
@@ -87, +156 @@

                echo
                cmd="bin/nutch generate $it_crawldb crawl/segments -topN 
$it_size"
                echo $cmd
+               $cmd
                output=`$cmd`
                echo $output
                if [[ $output == *'0 records selected for fetching'* ]] #all 
the urls of this iteration have been fetched
@@ -149, +219 @@

  rm -r $crawl_dump $it_seedsDir
  echoThenRun "bin/nutch readdb $allcrawldb -dump $crawl_dump" # you can 
inspect the dump with $ vim $crawl_dump
  bin/nutch readdb $allcrawldb -stats
+ 
  }}}

[Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout

Reply via email to