[Nutch Wiki] Update of "NutchTutorial" by JakeVanderdray

Apache Wiki Tue, 07 Mar 2006 10:42:57 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by JakeVanderdray:
http://wiki.apache.org/nutch/NutchTutorial

------------------------------------------------------------------------------
  
  Now we're ready to crawl. There are two approaches to crawling:
  
-  1. Intranet crawling, with the crawl command.
-  2. Whole-web crawling, with much greater control, using the lower level 
inject, generate, fetch and updatedb commands.
+  1. Using the '''crawl''' command to perform all the crawl steps with a 
single command.  This is sometimes referred to as '''Intranet Crawling'''.  
Although a simple way to get started, it has limitations.
+  2. Using the lower level inject, generate, fetch and updatedb commands.  
Sometimes refferred to as '''Whole-Web Crawling''' this allows you more control 
of each step of the process and is required to be able to update existing data.
  
- == Intranet Crawling ==
+ == The Crawl Command ==
  
- Intranet crawling is more appropriate when you intend to crawl up to around 
one million pages on a handful of web servers.
+ The crawl comamnd is more appropriate when you intend to crawl up to around 
one million pages on a handful of web servers.
  
- === Intranet: Configuration ===
+ === Crawl Command: Configuration ===
  
- To configure things for intranet crawling you must:
+ To configure things for the crawl command you must:
  
   * Create a directory with a flat file of root urls. For example, to crawl 
the nutch site you might start with a file named urls/nutch containing the url 
of just the Nutch home page. All other Nutch pages should be reachable from 
this page. The urls/nutch file would thus contain:
  
@@ -40, +40 @@

  
   This will include any url in the domain apache.org.
  
- === Intranet: Running the Crawl ===
+ === Crawl Command: Running the Crawl ===
  
  Once things are configured, running the crawl is easy. Just use the crawl 
command. Its options include:
  
@@ -57, +57 @@

  
  Once crawling has completed, one can skip to the Searching section below.
  
- == Whole-web Crawling ==
+ == Step-by-Step or Whole-web Crawling ==
  
  Whole-web crawling is designed to handle very large crawls which may take 
weeks to complete, running on multiple machines.
  
- === Whole-web: Concepts ===
+ === Step-by-Step: Concepts ===
  
  Nutch data is composed of:
  
@@ -76, +76 @@

     * a ''crawl_parse'' contains the outlink urls, used to update the crawldb
   1. The indexes are Lucene-format indexes.
  
- === Whole-web: Boostrapping the Web Database ===
+ === Step-by-Step: Boostrapping the Web Database ===
  
  The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open 
Directory. First we must download and uncompress the file listing all of the 
DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)
  
@@ -94, +94 @@

  
  Now we have a web database with around 1000 as-yet unfetched URLs in it.
  
- === Whole-web: Fetching ===
+ === Step-by-Step: Fetching ===
  
  To fetch, we first generate a fetchlist from the database:
  
@@ -135, +135 @@

  
  By this point we've fetched a few thousand pages. Let's index them!
  
- === Whole-web: Indexing ===
+ === Step-by-Step: Indexing ===
  
  Before indexing we first invert all of the links, so that we may index 
incoming anchor text with the pages.

[Nutch Wiki] Update of "NutchTutorial" by JakeVanderdray

Reply via email to