Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by tyrellperera: http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine ------------------------------------------------------------------------------ The Nutch search engine consists, very roughly, of three components: 1. The Crawler, which discovers and retrieves web pages - - 2. The âWebDBâ, a custom database that stores known URLs and fetched page contents + [[BR]]2. The âWebDBâ, a custom database that stores known URLs and fetched page contents - - 3. The âIndexerâ, which dissects pages and builds keyword-based indexes from them + [[BR]]3. The âIndexerâ, which dissects pages and builds keyword-based indexes from them - After the initial creation of an Index, it is usual to perform periodic updates of the index, in order to keep it up-to-date. We will look into the details of index maintenance in the parts following this. + (!) After the initial creation of an Index, it is usual to perform periodic updates of the index, in order to keep it up-to-date. We will look into the details of index maintenance in the parts following this. == 2.2 The Nutch Web Application == @@ -61, +59 @@ All components listed above use the nutch API. The users can utilize the API via two approaches, which depends on the task at hand. 1. Through the nutch Shell Script for administrative tasks, such as creating and maintaining indexes - 2. Through the Search Web Application, in order to perform a search using keywords + [[BR]]2. Through the Search Web Application, in order to perform a search using keywords = 3 Implementing a Nutch Search = @@ -69, +67 @@ Implementing our own version of Nutch is fairly easy, provided that you; 1. have a basic understanding of how a web search engine works and - 2. are comfortable working in a command line and finally + [[BR]]2. are comfortable working in a command line and finally - 3. have a fair knowledge of Java and Servlet containers + [[BR]]3. have a fair knowledge of Java and Servlet containers If you said âyesâ to all three questions above, you have a very high probability of having your Nutch implementation up and running by the end of the steps which follows. @@ -81, +79 @@ Go to http://www.apache.org/dyn/closer.cgi/lucene/nutch/ and select a mirror to download Nutch. The version described in this document is version 0.7. After downloading the archive, extract it to your disk. - NOTE: This document assumes that the archive was extracted to /home/tyrell/nutch-0.7 change this path to reflect your location. + /!\ NOTE: This document assumes that the archive was extracted to /home/tyrell/nutch-0.7 change this path to reflect your location. === 3.1.2 Download and Install a Servlet Container === @@ -125, +123 @@ * -threads threads determines the number of threads that will fetch in parallel. }}} - For example, a typical command might be: + For example, a typical command might be: {{{ bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}} @@ -225, +223 @@ }}} - === 3.5.2 Scheduling Index Updation === + === 3.5.2 Scheduling Index Updations === The above shell script can be scheduled to be run periodically using a âcronâ job.
