Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "RunningNutchAndSolr" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=72&rev2=73 The parser also takes a few minutes, as it must parse the full file. Finally, we initialize the crawl db with the selected urls. {{{ - bin/nutch inject crawl/crawldb dmoz + bin/nutch inject crawldb dmoz }}} Now we have a web database with around 1000 as-yet unfetched URLs in it. @@ -137, +137 @@ ===== Option 2. Bootstrapping from an initial seed list. ===== This option shadows the creation of the seed list as covered [[#3. Crawl your first website|here]]. + {{{ - {{{ bin/nutch inject crawldb urls }}} + bin/nutch inject crawldb urls + }}} ==== Step-by-Step: Fetching ==== To fetch, we first generate a fetch list from the database: + {{{ - {{{ bin/nutch generate crawldb segments }}} + bin/nutch generate crawldb segments + }}} This generates a fetch list for all of the pages due to be fetched. The fetch list is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable {{{s1}}}: @@ -152, +156 @@ }}} Now we run the fetcher on this segment with: + {{{ - {{{ bin/nutch fetch $s1 }}} + bin/nutch fetch $s1 + }}} When this is complete, we update the database with the results of the fetch: + {{{ - {{{ bin/nutch updatedb crawldb $s1 }}} + bin/nutch updatedb crawldb $s1 + }}} Now the database contains both updated entries for all initial pages as well as new entries that correspond to newly discovered pages linked from the initial set. Then we parse the entries: + {{{ - {{{ bin/nutch parse $1 }}} + bin/nutch parse $1 + }}} Now we generate and fetch a new segment containing the top-scoring 1000 pages: @@ -191, +201 @@ ==== Step-by-Step: Invertlinks ==== Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages. + {{{ - {{{ bin/nutch invertlinks linkdb -dir segments }}} + bin/nutch invertlinks linkdb -dir segments + }}} We are now ready to search with Apache Solr.

