[Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout

Apache Wiki Sun, 27 Mar 2011 05:35:46 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Tutorial on incremental crawling" page has been changed by Gabriele 
Kahlout.
http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  If not ready, follow [[Tutorial]] to setup and configure Nutch on your 
machine.
  
- It also works with Solr. If you have Solr setup
+ Follow 2 script:
  
- {{{
+ 1. Abridged script using Solr;
+ 
+ 2. Unabridged script with explanations and using nutch index.
+ 
+ == 1. Abridged script using Solr ==
+ == 2. Unabridged script with explanations and using nutch index ==
- #!/bin/sh
+ {{{#!/bin/sh
  #
  # Created by Gabriele Kahlout on 27.03.11.
- # 
+ #
- # The following script crawls the whole-web incrementally; Specifying a list 
of urls to crawl, nutch will continuously fetch $it_size urls from a 
+ # The following script crawls the whole-web incrementally; Specifying a list 
of urls to crawl, nutch will continuously fetch $it_size urls from a
  # specified list of urls, index and merge them with our whole-web index,  so 
that they can be immediately searched, until all urls have been fetched.
  #
  # Usage: ./whole-web-crawling-incremental [it_seedsDir-path 
urls-to-fetch-per-iteration depth]
@@ -23, +28 @@

  # 2. $ cd $NUTCH_HOME
  # 3. $ chmod +x whole-web-crawling-incremental
  # 4. $ ./whole-web-crawling-incremental
- # 
+ #
  # Start
  function echoThenRun () { # echo and then run the command
    echo $1
@@ -69, +74 @@

      do
          echo
          echo "generate-fetch-updatedb-invertlinks-index-merge iteration "$i":"
+         echo
-         echoThenRun "bin/nutch generate $it_crawldb crawl/segments -topN 
$it_size"
+       cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size"
-         output=`$cmd`
-         echo $output
+       echo $cmd
+       output=`$cmd`
+       echo $output
          if [[ $output == *'0 records selected for fetching'* ]] #all the urls 
of this iteration have been fetched
          then
              break;

[Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout

Reply via email to