[Nutch-cvs] [Nutch Wiki] Update of "NutchTutorial" by AndrzejBialecki

Apache Wiki Wed, 11 Apr 2007 15:52:06 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by AndrzejBialecki:
http://wiki.apache.org/nutch/NutchTutorial

The comment on the change is:
Some clarifications related to 0.8+ ,

------------------------------------------------------------------------------
  == Requirements ==
  
-  1. Java 1.4.x, either from Sun or IBM on Linux is preferred. Set 
NUTCH_JAVA_HOME to the root of your JVM installation.
+  1. Java 1.4.x, either from Sun or IBM on Linux is preferred. Set 
NUTCH_JAVA_HOME to the root of your JVM installation. Nutch 0.9 requires Sun 
JDK 1.5 or higher.
-  1. Apache's Tomcat 4.x.
+  1. Apache's Tomcat 4.x. or higher.
   1. On Win32, cygwin, for shell support. (If you plan to use Subversion on 
Win32, be sure to select the subversion package when you install, in the 
"Devel" category.)
   1. Up to a gigabyte of free disk space, a high-speed connection, and an hour 
or so.
  
@@ -75, +75 @@

   1. A set of segments. Each segment is a set of urls that are fetched as a 
unit. Segments are directories with the following subdirectories:
     * a ''crawl_generate'' names a set of urls to be fetched
     * a ''crawl_fetch'' contains the status of fetching each url
-    * a ''content contains'' the content of each url
+    * a ''content'' contains the raw content retrieved from each url
     * a ''parse_text'' contains the parsed text of each url
     * a ''parse_data'' contains outlinks and metadata parsed from each url
     * a ''crawl_parse'' contains the outlink urls, used to update the crawldb
@@ -83, +83 @@

  
  === Step-by-Step: Seeding the Crawl DB with a list of URLS ===
  
- Option 1:  Bootstraping the DMOZ database
+ ==== Option 1:  Bootstrapping from the DMOZ database. ====
+ 
  The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open 
Directory. First we must download and uncompress the file listing all of the 
DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)
  
  {{{ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
@@ -100, +101 @@

  
  Now we have a web database with around 1000 as-yet unfetched URLs in it.
  
+ ==== Option 2.  Bootstrapping from an initial seed list. ====
+ 
- Option 2.  Instead of Bootsrapping DMOZ, we can create a text file called 
urls, this file should have one url per line.  We can initialize the crawl db 
with the selected urls.
+ Instead of bootstrapping from DMOZ, we can create a text file called 
{{{urls}}}, this file should have one url per line.  We can initialize the 
crawl db with the selected urls.
  
  {{{ bin/nutch inject crawl/crawldb urls }}}
+ 
+ ''NOTE: version 0.8 and higher requires that we put this file into
+ a subdirectory, e.g. {{{seed/urls}}} - in this case the command looks like 
this:''
+   
+ {{{ bin/nutch inject crawl/crawldb seed }}}
  
  
  === Step-by-Step: Fetching ===
@@ -111, +119 @@

  
  {{{ bin/nutch generate crawl/crawldb crawl/segments }}}
  
- This generates a fetchlist for all of the pages due to be fetched. The 
fetchlist is placed in a newly created segment directory. The segment directory 
is named by the time it's created. We save the name of this segment in the 
shell variable s1:
+ This generates a fetchlist for all of the pages due to be fetched. The 
fetchlist is placed in a newly created segment directory. The segment directory 
is named by the time it's created. We save the name of this segment in the 
shell variable {{{s1}}}:
  
  {{{ s1=`ls -d crawl/segments/2* | tail -1`
  echo $s1 }}}
@@ -124, +132 @@

  
  {{{ bin/nutch updatedb crawl/crawldb $s1 }}}
  
- Now the database has entries for all of the pages referenced by the initial 
set.
+ Now the database contains both updated entries for all initial pages as well 
as new entries
+ that correspond to newly discovered pages linked from the initial set.
  
- Now we fetch a new segment with the top-scoring 1000 pages:
+ Now we generate and fetch a new segment containing the top-scoring 1000 pages:
  
  {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
  s2=`ls -d crawl/segments/2* | tail -1`
@@ -168, +177 @@

  
  After you have verified that the above command returns results you can 
proceed to setting up the web interface. 
  
- To search you need to put the nutch war file into your servlet container. (If 
instead of downloading a Nutch release you checked the sources out of SVN, then 
you'll first need to build the war file, with the command ant war.)
+ To search you need to put the nutch war file into your servlet container. (If 
instead of downloading a Nutch release you checked the sources out of SVN, then 
you'll first need to build the war file, with the command {{{ant war}}}.)
  
  Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file 
may be installed with the commands:
  

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "NutchTutorial" by AndrzejBialecki

Reply via email to