[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

Apache Wiki Thu, 23 Mar 2006 21:05:47 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by tyrellperera:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

------------------------------------------------------------------------------
  The Nutch search engine consists, very roughly, of three components:
  
  1. The Crawler, which discovers and retrieves web pages
- 
- 2. The âWebDBâ, a custom database that stores known URLs and fetched page 
contents
+ [[BR]]2. The âWebDBâ, a custom database that stores known URLs and 
fetched page contents
- 
- 3. The âIndexerâ, which dissects pages and builds keyword-based indexes 
from them
+ [[BR]]3. The âIndexerâ, which dissects pages and builds keyword-based 
indexes from them
  
- After the initial creation of an Index, it is usual to perform periodic 
updates of the index, in order to keep it up-to-date. We will look into the 
details of index maintenance in the parts following this.
+ (!) After the initial creation of an Index, it is usual to perform periodic 
updates of the index, in order to keep it up-to-date. We will look into the 
details of index maintenance in the parts following this.
  
  == 2.2 The Nutch Web Application ==
  
@@ -61, +59 @@

  All components listed above use the nutch API. The users can utilize the API 
via two approaches, which depends on the task at hand.
  
  1. Through the nutch Shell Script for administrative tasks, such as creating 
and maintaining indexes
- 2. Through the Search Web Application, in order to perform a search using 
keywords
+ [[BR]]2. Through the Search Web Application, in order to perform a search 
using keywords
  
  
  = 3 Implementing a Nutch Search =
@@ -69, +67 @@

  Implementing our own version of Nutch is fairly easy, provided that you;
  
  1. have a basic understanding of how a web search engine works and 
- 2. are comfortable working in a command line and finally
+ [[BR]]2. are comfortable working in a command line and finally
- 3. have a fair knowledge of Java and Servlet containers
+ [[BR]]3. have a fair knowledge of Java and Servlet containers
  
  If you said âyesâ to all three questions above, you have a very high 
probability of having your Nutch implementation up and running by the end of 
the steps which follows. 
  
@@ -81, +79 @@

  
  Go to http://www.apache.org/dyn/closer.cgi/lucene/nutch/ and select a mirror 
to download Nutch. The version described in this document is version 0.7. After 
downloading the archive, extract it to your disk. 
  
- NOTE: This document assumes that the archive was extracted to 
/home/tyrell/nutch-0.7 change this path to reflect your location.
+ /!\ NOTE: This document assumes that the archive was extracted to 
/home/tyrell/nutch-0.7 change this path to reflect your location.
  
  === 3.1.2 Download and Install a Servlet Container ===
  
@@ -125, +123 @@

   * -threads threads determines the number of threads that will fetch in 
parallel.
   }}}
  
-       For example, a typical command might be:
+ For example, a typical command might be:
  
        {{{ bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}}
  
@@ -225, +223 @@

    }}}
  
  
- === 3.5.2 Scheduling Index Updation ===
+ === 3.5.2 Scheduling Index Updations ===
  
  The above shell script can be scheduled to be run periodically using a 
âcronâ job.

[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

Reply via email to