Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JakeVanderdray: http://wiki.apache.org/nutch/NutchTutorial ------------------------------------------------------------------------------ Intranet crawling is more appropriate when you intend to crawl up to around one million pages on a handful of web servers. - == Intranet: Configuration == + === Intranet: Configuration === To configure things for intranet crawling you must: @@ -40, +40 @@ This will include any url in the domain apache.org. - == Intranet: Running the Crawl == + === Intranet: Running the Crawl === Once things are configured, running the crawl is easy. Just use the crawl command. Its options include: @@ -117, +117 @@ Now we fetch a new segment with the top-scoring 1000 pages: - bin/nutch generate crawl/crawldb crawl/segments -topN 1000 + {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 - bin/nutch updatedb crawl/crawldb $s2 + bin/nutch updatedb crawl/crawldb $s2 }}} + Let's fetch one more round: - bin/nutch generate crawl/crawldb crawl/segments -topN 1000 + {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s3=`ls -d crawl/segments/2* | tail -1` echo $s3 bin/nutch fetch $s3 - bin/nutch updatedb crawl/crawldb $s3 + bin/nutch updatedb crawl/crawldb $s3 }}} + By this point we've fetched a few thousand pages. Let's index them! - Whole-web: Indexing + === Whole-web: Indexing === + Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages. - bin/nutch invertlinks crawl/linkdb crawl/segments + {{{ bin/nutch invertlinks crawl/linkdb crawl/segments }}} + To index the segments we use the index command, as follows: - bin/nutch index indexes crawl/linkdb crawl/segments/* + {{{ bin/nutch index indexes crawl/linkdb crawl/segments/* }}} + Now we're ready to search! - Searching + == Searching == + To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command ant war.) Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands: - rm -rf ~/local/tomcat/webapps/ROOT* + {{{ rm -rf ~/local/tomcat/webapps/ROOT* - cp nutch*.war ~/local/tomcat/webapps/ROOT.war + cp nutch*.war ~/local/tomcat/webapps/ROOT.war }}} + The webapp finds its indexes in ./crawl, relative to where you start Tomcat, so use a command like: - ~/local/tomcat/bin/catalina.sh start + {{{ ~/local/tomcat/bin/catalina.sh start }}} + Then visit http://localhost:8080/ and have fun!
