Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by RichardBraman: http://wiki.apache.org/nutch/NutchTutorial ------------------------------------------------------------------------------ * a ''crawl_parse'' contains the outlink urls, used to update the crawldb 1. The indexes are Lucene-format indexes. - === Step-by-Step: Seeding the CrawlDB with a list of URLS === + === Step-by-Step: Seeding the Crawl DB with a list of URLS === Option 1: Bootstraping the DMOZ database The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.) @@ -146, +146 @@ Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages. {{{ bin/nutch invertlinks crawl/linkdb crawl/segments }}} - + NOTE: the invertlinks command only applies to Nutch 0.8 and higher. To index the segments we use the index command, as follows: {{{ bin/nutch index indexes crawl/linkdb crawl/segments/* }}}
