[Nutch Wiki] Update of "Nutch - The Java Search Engine" by DanielLaLiberte

Apache Wiki Mon, 19 Feb 2007 11:25:42 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by DanielLaLiberte:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

------------------------------------------------------------------------------
  
  === 3.2.1 Create a flat file of root urls. ===
  
- For example, to crawl the http://www.virtusa.com  site from scratch, you 
might start with a file named 'urls' containing just the Virtusa home page. All 
other pages should be reachable from this page. The âurlsâ file would 
therefore contain: http://www.virtusa.com
+ For example, to crawl the http://www.virtusa.com  site from scratch, you 
might start with a file named 'urls' containing just the URL for the Virtusa 
home page. All other pages should be reachable by links from this page. The 
âurlsâ file would therefore contain: http://www.virtusa.com
+ 
+ The 'urls' file could be put anywhere.  It will be used below in the nutch 
crawl command, which assumes the file is in the nutch directory.
        
  === 3.2.2 Edit the file conf/crawl-urlfilter.txt ===
  
  If you are using TRUNK then there is no file called conf/crawl-urlfilter.txt 
but conf/crawl-urlfilter.txt.template. Just do 
  {{{
-  cat conf/crawl-urlfilter.txt.template|sed 
's/MY.DOMAIN.NAME/criaturitas.org/'g> conf/crawl-urlfilter.txt
+  cat conf/crawl-urlfilter.txt.template|sed 
's/MY.DOMAIN.NAME/criaturitas.org/'g> conf/crawl-urlfilter.txt }}}
- }}}
  If you already have this file then replace the existing domain name with the 
name of the domain you wish to crawl. For example, if you wished to limit the 
crawl to the virtusa.com domain, the line should read:
  
+  {{{
-               {{{ +^http://([a-z0-9]*\.)*virtusa.com/ }}}
+  +^http://([a-z0-9]*\.)*virtusa.com/ }}}
  
  This will include any url in the domain virtusa.com in the crawl.
           
@@ -127, +129 @@

   * -dir dir names the directory to put the crawl in.
   * -depth depth indicates the link depth from the root page that should be 
crawled.
   * -delay delay determines the number of seconds between accesses to each 
host.
-  * -threads threads determines the number of threads that will fetch in 
parallel.
+  * -threads threads determines the number of threads that will fetch in 
parallel. }}}
-  }}}
  
  For example, a typical command might be:
  
+  {{{
-       {{{ bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}}
+  bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}}
  
  === 3.2.4 Output of the crawl ===
  
@@ -175, +177 @@

       <property>
          <name>searcher.dir </name>
          <value>/home/tyrell/nutch-0.7/crawl.virtusa </value>  
-      </property>
+      </property> }}}
-      }}}
  
    4. Re-start Tomcat
  
@@ -196, +197 @@

  
  Now that all is working, we need to think the long term maintenance of the 
Index. This is a required activity because the web gets updated frequently. New 
content will appear on sites while existing content might get modified or 
deleted altogether.
  
- Nutch provides the administrator with a set of commands to update a given 
index, however performing them manually will not only be tiresome but also non 
productive. Since this task need to be carried out periodically it should 
ideally be scheduled.
+ Nutch provides the administrator with a set of commands to update a given 
index, however performing them manually will not only be tiresome but also 
unproductive. Since this task need to be carried out periodically it should 
ideally be scheduled.
  
  === 3.5.1 Creating a Maintenance Shell Script ===
  
@@ -233, +234 @@

    }}}
  
  
- === 3.5.2 Scheduling Index Updations ===
+ === 3.5.2 Scheduling Index Updates ===
  
  The above shell script can be scheduled to be run periodically using a 
âcronâ job.

[Nutch Wiki] Update of "Nutch - The Java Search Engine" by DanielLaLiberte

Reply via email to