[Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout

Apache Wiki Sun, 27 Mar 2011 06:00:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Tutorial on incremental crawling" page has been changed by Gabriele 
Kahlout.
http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=7&rev2=8

--------------------------------------------------

  #
  # Created by Gabriele Kahlout on 27.03.11.
  # The following script crawls the whole-web incrementally; Specifying a list 
of urls to crawl, nutch will continuously fetch $it_size urls from a specified 
list of urls, index and merge them with our whole-web index,  so that they can 
be immediately searched, until all urls have been fetched.
+ # It assumes that you have setup Solr and it's running on port 8080.
  #
  # TO USE:
  # 1. $ mv whole-web-crawling-incremental 
$NUTCH_HOME/whole-web-crawling-incremental
  # 2. $ cd $NUTCH_HOME
  # 3. $ chmod +x whole-web-crawling-incremental
- # 4. $ ./whole-web-crawling-incremental seeds 5 2
+ # 4. $ ./whole-web-crawling-incremental
  
- # Usage: ./whole-web-crawling-incremental it_seedsDir-path 
urls-to-fetch-per-iteration depth
+ # Usage: ./whole-web-crawling-incremental [it_seedsDir-path 
urls-to-fetch-per-iteration depth]
  # Start
  
- rm -r crawl # fresh crawl
+ rm -r crawl
  
  seedsDir=$1
  it_size=$2
@@ -41, +42 @@

  mkdir $it_seedsDir
  
  allUrls=`cat $seedsDir/*url* | wc -l | sed -e "s/^ *//"`
- echo $allUrls" urls to crawl"
  
  it_crawldb="crawl/crawldb"

[Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout

Reply via email to