Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by AndrzejBialecki: http://wiki.apache.org/nutch/NutchTutorial The comment on the change is: Some clarifications related to 0.8+ , ------------------------------------------------------------------------------ == Requirements == - 1. Java 1.4.x, either from Sun or IBM on Linux is preferred. Set NUTCH_JAVA_HOME to the root of your JVM installation. + 1. Java 1.4.x, either from Sun or IBM on Linux is preferred. Set NUTCH_JAVA_HOME to the root of your JVM installation. Nutch 0.9 requires Sun JDK 1.5 or higher. - 1. Apache's Tomcat 4.x. + 1. Apache's Tomcat 4.x. or higher. 1. On Win32, cygwin, for shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.) 1. Up to a gigabyte of free disk space, a high-speed connection, and an hour or so. @@ -75, +75 @@ 1. A set of segments. Each segment is a set of urls that are fetched as a unit. Segments are directories with the following subdirectories: * a ''crawl_generate'' names a set of urls to be fetched * a ''crawl_fetch'' contains the status of fetching each url - * a ''content contains'' the content of each url + * a ''content'' contains the raw content retrieved from each url * a ''parse_text'' contains the parsed text of each url * a ''parse_data'' contains outlinks and metadata parsed from each url * a ''crawl_parse'' contains the outlink urls, used to update the crawldb @@ -83, +83 @@ === Step-by-Step: Seeding the Crawl DB with a list of URLS === - Option 1: Bootstraping the DMOZ database + ==== Option 1: Bootstrapping from the DMOZ database. ==== + The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.) {{{ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz @@ -100, +101 @@ Now we have a web database with around 1000 as-yet unfetched URLs in it. + ==== Option 2. Bootstrapping from an initial seed list. ==== + - Option 2. Instead of Bootsrapping DMOZ, we can create a text file called urls, this file should have one url per line. We can initialize the crawl db with the selected urls. + Instead of bootstrapping from DMOZ, we can create a text file called {{{urls}}}, this file should have one url per line. We can initialize the crawl db with the selected urls. {{{ bin/nutch inject crawl/crawldb urls }}} + + ''NOTE: version 0.8 and higher requires that we put this file into + a subdirectory, e.g. {{{seed/urls}}} - in this case the command looks like this:'' + + {{{ bin/nutch inject crawl/crawldb seed }}} === Step-by-Step: Fetching === @@ -111, +119 @@ {{{ bin/nutch generate crawl/crawldb crawl/segments }}} - This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1: + This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable {{{s1}}}: {{{ s1=`ls -d crawl/segments/2* | tail -1` echo $s1 }}} @@ -124, +132 @@ {{{ bin/nutch updatedb crawl/crawldb $s1 }}} - Now the database has entries for all of the pages referenced by the initial set. + Now the database contains both updated entries for all initial pages as well as new entries + that correspond to newly discovered pages linked from the initial set. - Now we fetch a new segment with the top-scoring 1000 pages: + Now we generate and fetch a new segment containing the top-scoring 1000 pages: {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` @@ -168, +177 @@ After you have verified that the above command returns results you can proceed to setting up the web interface. - To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command ant war.) + To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command {{{ant war}}}.) Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands: ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs