Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by RichardBraman: http://wiki.apache.org/nutch/NutchTutorial ------------------------------------------------------------------------------ * a ''crawl_parse'' contains the outlink urls, used to update the crawldb 1. The indexes are Lucene-format indexes. - === Step-by-Step: Boostrapping the Web Database === + === Step-by-Step: Seeding the CrawlDB with a list of URLS === + Option 1: Bootstraping the DMOZ database The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.) {{{ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz @@ -93, +94 @@ {{{ bin/nutch inject crawl/crawldb dmoz }}} Now we have a web database with around 1000 as-yet unfetched URLs in it. + + Option 2. Instead of Bootsrapping DMOZ, we can create a text file called urls, this file should have one url per line. We can initialize the crawl db with the selected urls. + + {{{ bin/nutch inject crawl/crawldb urls }}} + === Step-by-Step: Fetching ===
