[Nutch Wiki] Update of "NutchTutorial" by JakeVanderdray

Apache Wiki Tue, 07 Mar 2006 08:45:22 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by JakeVanderdray:
http://wiki.apache.org/nutch/NutchTutorial

------------------------------------------------------------------------------
  
  Intranet crawling is more appropriate when you intend to crawl up to around 
one million pages on a handful of web servers.
  
- == Intranet: Configuration ==
+ === Intranet: Configuration ===
  
  To configure things for intranet crawling you must:
  
@@ -40, +40 @@

  
   This will include any url in the domain apache.org.
  
- == Intranet: Running the Crawl ==
+ === Intranet: Running the Crawl ===
  
  Once things are configured, running the crawl is easy. Just use the crawl 
command. Its options include:
  
@@ -117, +117 @@

  
  Now we fetch a new segment with the top-scoring 1000 pages:
  
- bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+ {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
  s2=`ls -d crawl/segments/2* | tail -1`
  echo $s2
  
  bin/nutch fetch $s2
- bin/nutch updatedb crawl/crawldb $s2
+ bin/nutch updatedb crawl/crawldb $s2 }}}
+ 
  Let's fetch one more round:
  
- bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+ {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
  s3=`ls -d crawl/segments/2* | tail -1`
  echo $s3
  
  bin/nutch fetch $s3
- bin/nutch updatedb crawl/crawldb $s3
+ bin/nutch updatedb crawl/crawldb $s3 }}}
+ 
  By this point we've fetched a few thousand pages. Let's index them!
  
- Whole-web: Indexing
+ === Whole-web: Indexing ===
+ 
  Before indexing we first invert all of the links, so that we may index 
incoming anchor text with the pages.
  
- bin/nutch invertlinks crawl/linkdb crawl/segments
+ {{{ bin/nutch invertlinks crawl/linkdb crawl/segments }}}
+ 
  To index the segments we use the index command, as follows:
  
- bin/nutch index indexes crawl/linkdb crawl/segments/*
+ {{{ bin/nutch index indexes crawl/linkdb crawl/segments/* }}}
+ 
  Now we're ready to search!
  
- Searching
+ == Searching ==
+ 
  To search you need to put the nutch war file into your servlet container. (If 
instead of downloading a Nutch release you checked the sources out of SVN, then 
you'll first need to build the war file, with the command ant war.)
  
  Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file 
may be installed with the commands:
  
- rm -rf ~/local/tomcat/webapps/ROOT*
+ {{{ rm -rf ~/local/tomcat/webapps/ROOT*
- cp nutch*.war ~/local/tomcat/webapps/ROOT.war
+ cp nutch*.war ~/local/tomcat/webapps/ROOT.war }}}
+ 
  The webapp finds its indexes in ./crawl, relative to where you start Tomcat, 
so use a command like:
  
- ~/local/tomcat/bin/catalina.sh start
+ {{{ ~/local/tomcat/bin/catalina.sh start }}}
+ 
  Then visit http://localhost:8080/ and have fun!

[Nutch Wiki] Update of "NutchTutorial" by JakeVanderdray

Reply via email to