Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JakeVanderdray: http://wiki.apache.org/nutch/NutchTutorial ------------------------------------------------------------------------------ Now we're ready to crawl. There are two approaches to crawling: - 1. Intranet crawling, with the crawl command. - 2. Whole-web crawling, with much greater control, using the lower level inject, generate, fetch and updatedb commands. + 1. Using the '''crawl''' command to perform all the crawl steps with a single command. This is sometimes referred to as '''Intranet Crawling'''. Although a simple way to get started, it has limitations. + 2. Using the lower level inject, generate, fetch and updatedb commands. Sometimes refferred to as '''Whole-Web Crawling''' this allows you more control of each step of the process and is required to be able to update existing data. - == Intranet Crawling == + == The Crawl Command == - Intranet crawling is more appropriate when you intend to crawl up to around one million pages on a handful of web servers. + The crawl comamnd is more appropriate when you intend to crawl up to around one million pages on a handful of web servers. - === Intranet: Configuration === + === Crawl Command: Configuration === - To configure things for intranet crawling you must: + To configure things for the crawl command you must: * Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain: @@ -40, +40 @@ This will include any url in the domain apache.org. - === Intranet: Running the Crawl === + === Crawl Command: Running the Crawl === Once things are configured, running the crawl is easy. Just use the crawl command. Its options include: @@ -57, +57 @@ Once crawling has completed, one can skip to the Searching section below. - == Whole-web Crawling == + == Step-by-Step or Whole-web Crawling == Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. - === Whole-web: Concepts === + === Step-by-Step: Concepts === Nutch data is composed of: @@ -76, +76 @@ * a ''crawl_parse'' contains the outlink urls, used to update the crawldb 1. The indexes are Lucene-format indexes. - === Whole-web: Boostrapping the Web Database === + === Step-by-Step: Boostrapping the Web Database === The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.) @@ -94, +94 @@ Now we have a web database with around 1000 as-yet unfetched URLs in it. - === Whole-web: Fetching === + === Step-by-Step: Fetching === To fetch, we first generate a fetchlist from the database: @@ -135, +135 @@ By this point we've fetched a few thousand pages. Let's index them! - === Whole-web: Indexing === + === Step-by-Step: Indexing === Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.
