Author: siren
Date: Thu Jun 8 13:03:11 2006
New Revision: 412847
URL: http://svn.apache.org/viewvc?rev=412847&view=rev
Log:
updated content from 0.7.2, added page about nightly builds, added hadoop as
related project
Added:
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml
Modified:
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml
Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml
(original)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/index.xml Thu
Jun 8 13:03:11 2006
@@ -15,6 +15,22 @@
<title>News</title>
<section>
+ <title>31 March 2006: Nutch 0.7.2 Released</title>
+ <p>The 0.7.2 release of Nutch is now available. This is a bug fix
release for 0.7 branch. See
+ <a
href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158">
+ CHANGES.txt</a> for details. The release is available
+ <a href="http://lucene.apache.org/nutch/release/">here</a>.</p>
+ </section>
+
+ <section>
+ <title>1 October 2005: Nutch 0.7.1 Released</title>
+ <p>The 0.7.1 release of Nutch is now available. This is a bug fix
release. See
+ <a
href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986">
+ CHANGES.txt</a> for details. The release is available
+ <a href="http://lucene.apache.org/nutch/release/">here</a>.</p>
+ </section>
+
+ <section>
<title>17 August 2005: Nutch 0.7 Released</title>
<p>This is the first Nutch release as an Apache Lucene sub-project. See
<a
href="http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/CHANGES.txt?rev=233150">
Modified:
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
---
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml
(original)
+++
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/issue_tracking.xml
Thu Jun 8 13:03:11 2006
@@ -11,7 +11,7 @@
<body>
<p>
Nutch issues (bugs, as well as enhancement requests) are tracked in
- Apache JIRA <a
href="http://nagoya.apache.org/jira/browse/Nutch">here</a>.
+ Apache JIRA <a
href="http://issues.apache.org/jira/browse/Nutch">here</a>.
If you aren't sure whether something is a bug, post a question on the
Nutch user <a href="mailing_lists.html">mailing list</a>.
</p>
Added: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml?rev=412847&view=auto
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml
(added)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/nightly.xml Thu
Jun 8 13:03:11 2006
@@ -0,0 +1,28 @@
+<?xml version="1.0"?>
+
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
"http://forrest.apache.org/dtd/document-v20.dtd">
+
+<document>
+
+ <header>
+ <title>Nightly builds</title>
+ </header>
+
+ <body>
+ <p>
+ Nightly binary builds contains the latest code available. Nightly
+ binary builds are provided for testing only. They might or might not
+ be functional.
+ </p>
+ <p>
+ You can track the progress of 0.8-dev version from a <a
href="http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.system.project:roadmap-panel">jira
report</a>
+ </p>
+ <p>
+ To report bugs see <a href="issue_tracking.html">issue tracking</a>
+ </p>
+ <p>
+ <a href="http://people.apache.org/dist/lucene/nutch/nightly/">Nutch
nightly builds</a> (0.8-dev)
+ </p>
+ </body>
+
+</document>
\ No newline at end of file
Modified: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml
(original)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/site.xml Thu
Jun 8 13:03:11 2006
@@ -26,7 +26,8 @@
<docs label="Documentation">
<faq label="FAQ" href="ext:faq" />
<wiki label="Wiki" href="ext:wiki" />
- <tutorial label="Tutorial" href="tutorial.html" />
+ <tutorial label="Tutorial ver. 0.7" href="tutorial.html" />
+ <tutorial8 label="Tutorial ver. 0.8" href="tutorial8.html" />
<webmasters label="Robot " href="bot.html" />
<i18n label="i18n" href="i18n.html" />
<apidocs label="API Docs" href="apidocs/index.html" />
@@ -34,17 +35,21 @@
<resources label="Resources">
<download label="Download" href="release/" />
+ <nightly label="Nightly builds" href="nightly.html" />
<contact label="Mailing Lists" href="mailing_lists.html" />
<issues label="Issue Tracking" href="issue_tracking.html" />
<vcs label="Version Control" href="version_control.html" />
</resources>
+
<projects label="Related Projects">
<lucene label="Lucene Java" href="ext:lucene" />
+ <hadoop label="Hadoop" href="ext:hadoop" />
</projects>
<external-refs>
<lucene href="http://lucene.apache.org/java/" />
+ <hadoop href="http://lucene.apache.org/hadoop/" />
<wiki href="http://wiki.apache.org/nutch/" />
<faq href="http://wiki.apache.org/nutch/FAQ" />
<store href="http://www.cafepress.com/nutch/" />
Modified:
lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml?rev=412847&r1=412846&r2=412847&view=diff
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml
(original)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial.xml
Thu Jun 8 13:03:11 2006
@@ -6,7 +6,7 @@
<document>
<header>
- <title>Nutch tutorial</title>
+ <title>Nutch version 0.7 tutorial</title>
</header>
<body>
@@ -66,11 +66,11 @@
<ol>
-<li>Create a directory with a flat file of root urls. For example, to
-crawl the <code>nutch</code> site you might start with a file named
-<code>urls/nutch</code> containing the url of just the Nutch home
-page. All other Nutch pages should be reachable from this page. The
-<code>urls/nutch</code> file would thus contain:
+<li>Create a flat file of root urls. For example, to crawl the
+<code>nutch</code> site you might start with a file named
+<code>urls</code> containing just the Nutch home page. All other
+Nutch pages should be reachable from this page. The <code>urls</code>
+file would thus look like:
<source>
http://lucene.apache.org/nutch/
</source>
@@ -97,28 +97,24 @@
<ul>
<li><code>-dir</code> <em>dir</em> names the directory to put the crawl
in.</li>
-<li><code>-threads</code> <em>threads</em> determines the number of
-threads that will fetch in parallel.</li>
<li><code>-depth</code> <em>depth</em> indicates the link depth from the root
page that should be crawled.</li>
-<li><code>-topN</code> <em>N</em> determines the maximum number of pages that
-will be retrieved at each level up to the depth.</li>
+<li><code>-delay</code> <em>delay</em> determines the number of seconds
+between accesses to each host.</li>
+<li><code>-threads</code> <em>threads</em> determines the number of
+threads that will fetch in parallel.</li>
</ul>
<p>For example, a typical call might be:</p>
<source>
-bin/nutch crawl urls -dir crawl -depth 3 -topN 50
+bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
</source>
-<p>Typically one starts testing one's configuration by crawling at
-shallow depths, sharply limiting the number of pages fetched at each
-level (<code>-topN</code>), and watching the output to check that
-desired pages are fetched and undesirable pages are not. Once one is
-confident of the configuration, then an appropriate depth for a full
-crawl is around 10. The number of pages per level
-(<code>-topN</code>) for a full crawl can be from tens of thousands to
-millions, depending on your resources.</p>
+<p>Typically one starts testing one's configuration by crawling at low
+depths, and watching the output to check that desired pages are found.
+Once one is more confident of the configuration, then an appropriate
+depth for a full crawl is around 10.</p>
<p>Once crawling has completed, one can skip to the Searching section
below.</p>
@@ -135,62 +131,54 @@
<section>
<title>Whole-web: Concepts</title>
-<p>Nutch data is composed of:</p>
+<p>Nutch data is of two types:</p>
<ol>
-
- <li>The crawl database, or <em>crawldb</em>. This contains
-information about every url known to Nutch, including whether it was
-fetched, and, if so, when.</li>
-
- <li>The link database, or <em>linkdb</em>. This contains the list
-of known links to each url, including both the source url and anchor
-text of the link.</li>
-
- <li>A set of <em>segments</em>. Each segment is a set of urls that are
-fetched as a unit. Segments are directories with the following
-subdirectories:</li>
-
+ <li>The web database. This contains information about every
+page known to Nutch, and about links between those pages.</li>
+ <li>A set of segments. Each segment is a set of pages that are
+fetched and indexed as a unit. Segment data consists of the
+following types:</li>
<li><ul>
- <li>a <em>crawl_generate</em> names a set of urls to be fetched</li>
- <li>a <em>crawl_fetch</em> contains the status of fetching each url</li>
- <li>a <em>content</em> contains the content of each url</li>
- <li>a <em>parse_text</em> contains the parsed text of each url</li>
- <li>a <em>parse_data</em> contains outlinks and metadata parsed
- from each url</li>
- <li>a <em>crawl_parse</em> contains the outlink urls, used to
- update the crawldb</li>
+ <li>a <em>fetchlist</em> is a file
+that names a set of pages to be fetched</li>
+ <li>the<em> fetcher output</em> is a
+set of files containing the fetched pages</li>
+ <li>the <em>index </em>is a
+Lucene-format index of the fetcher output.</li>
</ul></li>
-
-<li>The <em>indexes</em>are Lucene-format indexes.</li>
-
</ol>
+<p>In the following examples we will keep our web database in a directory
+named <code>db</code> and our segments
+in a directory named <code>segments</code>:</p>
+<source>mkdir db
+mkdir segments</source>
</section>
<section>
<title>Whole-web: Boostrapping the Web Database</title>
+<p>The admin tool is used to create a new, empty database:</p>
+
+<source>bin/nutch admin db -create</source>
-<p>The <em>injector</em> adds urls to the crawldb. Let's inject URLs
-from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we
-must download and uncompress the file listing all of the DMOZ pages.
-(This is a 200+Mb file, so this will take a few minutes.)</p>
+<p>The <em>injector</em> adds urls into the database. Let's inject
+URLs from the <a href="http://dmoz.org/">DMOZ</a> Open
+Directory. First we must download and uncompress the file listing all
+of the DMOZ pages. (This is a 200+Mb file, so this will take a few
+minutes.)</p>
<source>wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz</source>
-<p>Next we select a random subset of these pages.
+<p>Next we inject a random subset of these pages into the web database.
(We use a random subset so that everyone who runs this tutorial
doesn't hammer the same sites.) DMOZ contains around three million
-URLs. We select one out of every 5000, so that we end up with
+URLs. We inject one out of every 3000, so that we end up with
around 1000 URLs:</p>
-<source>mkdir dmoz
-bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 >
dmoz/urls</source>
+<source>bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</source>
-<p>The parser also takes a few minutes, as it must parse the full
-file. Finally, we initialize the crawl db with the selected urls.</p>
-
-<source>bin/nutch inject crawl/crawldb dmoz</source>
+<p>This also takes a few minutes, as it must parse the full file.</p>
<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
@@ -198,39 +186,39 @@
<section>
<title>Whole-web: Fetching</title>
<p>To fetch, we first generate a fetchlist from the database:</p>
-<source>bin/nutch generate crawl/crawldb crawl/segments
+<source>bin/nutch generate db segments
</source>
<p>This generates a fetchlist for all of the pages due to be fetched.
The fetchlist is placed in a newly created segment directory.
The segment directory is named by the time it's created. We
save the name of this segment in the shell variable <code>s1</code>:</p>
-<source>s1=`ls -d crawl/segments/2* | tail -1`
+<source>s1=`ls -d segments/2* | tail -1`
echo $s1
</source>
<p>Now we run the fetcher on this segment with:</p>
<source>bin/nutch fetch $s1</source>
<p>When this is complete, we update the database with the results of the
fetch:</p>
-<source>bin/nutch updatedb crawl/crawldb $s1</source>
+<source>bin/nutch updatedb db $s1</source>
<p>Now the database has entries for all of the pages referenced by the
initial set.</p>
<p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
-<source>bin/nutch generate crawl/crawldb crawl/segments -topN 1000
-s2=`ls -d crawl/segments/2* | tail -1`
+<source>bin/nutch generate db segments -topN 1000
+s2=`ls -d segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
-bin/nutch updatedb crawl/crawldb $s2
+bin/nutch updatedb db $s2
</source>
<p>Let's fetch one more round:</p>
<source>
-bin/nutch generate crawl/crawldb crawl/segments -topN 1000
-s3=`ls -d crawl/segments/2* | tail -1`
+bin/nutch generate db segments -topN 1000
+s3=`ls -d segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
-bin/nutch updatedb crawl/crawldb $s3
+bin/nutch updatedb db $s3
</source>
<p>By this point we've fetched a few thousand pages. Let's index
@@ -239,20 +227,16 @@
</section>
<section>
<title>Whole-web: Indexing</title>
+<p>To index each segment we use the <code>index</code>
+command, as follows:</p>
+<source>bin/nutch index $s1
+bin/nutch index $s2
+bin/nutch index $s3</source>
-<p>Before indexing we first invert all of the links, so that we may
-index incoming anchor text with the pages.</p>
-
-<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source>
-
-<p>To index the segments we use the <code>index</code> command, as follows:</p>
-
-<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source>
-
-<!-- <p>Then, before we can search a set of segments, we need to delete -->
-<!-- duplicate pages. This is done with:</p> -->
+<p>Then, before we can search a set of segments, we need to delete
+duplicate pages. This is done with:</p>
-<!-- <source>bin/nutch dedup indexes</source> -->
+<source>bin/nutch dedup segments dedup.tmp</source>
<p>Now we're ready to search!</p>
@@ -272,8 +256,10 @@
cp nutch*.war ~/local/tomcat/webapps/ROOT.war
</source>
-<p>The webapp finds its indexes in <code>./crawl</code>, relative
-to where you start Tomcat, so use a command like:</p>
+<p>The webapp finds its indexes in <code>./segments</code>, relative
+to where you start Tomcat, so, if you've done intranet crawling,
+connect to your crawl directory, or, if you've done whole-web
+crawling, don't change directories, and give the command:</p>
<source>~/local/tomcat/bin/catalina.sh start
</source>
@@ -281,6 +267,8 @@
<p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
and have fun!</p>
+<p>More detailed tutorials are available on the Nutch Wiki.
+</p>
</section>
</section>
Added: lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml?rev=412847&view=auto
==============================================================================
--- lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml
(added)
+++ lucene/nutch/trunk/src/site/src/documentation/content/xdocs/tutorial8.xml
Thu Jun 8 13:03:11 2006
@@ -0,0 +1,291 @@
+<?xml version="1.0"?>
+
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
+ "http://forrest.apache.org/dtd/document-v20.dtd">
+
+<document>
+
+<header>
+ <title>Nutch version 0.8 tutorial</title>
+</header>
+
+<body>
+
+<section>
+<title>Requirements</title>
+<ol>
+ <li>Java 1.4.x, either from <a
+ href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a
+ href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on
+ Linux is preferred. Set <code>NUTCH_JAVA_HOME</code> to the root
+ of your JVM installation.
+ </li>
+ <li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a>
+4.x.</li>
+ <li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
+shell support. (If you plan to use Subversion on Win32, be sure to select the
subversion package when you install, in the "Devel" category.)</li>
+ <li>Up to a gigabyte of free disk space, a high-speed connection, and
+an hour or so.
+ </li>
+</ol>
+</section>
+<section>
+<title>Getting Started</title>
+
+<p>First, you need to get a copy of the Nutch code. You can download
+a release from <a
+href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>.
+Unpack the release and connect to its top-level directory. Or, check
+out the latest source code from <a
+href="version_control.html">subversion</a> and build it
+with <a href="http://ant.apache.org/">Ant</a>.</p>
+
+<p>Try the following command:</p>
+<source>bin/nutch</source>
+<p>This will display the documentation for the Nutch command script.</p>
+
+<p>Now we're ready to crawl. There are two approaches to crawling:</p>
+<ol>
+<li>Intranet crawling, with the <code>crawl</code> command.</li>
+<li>Whole-web crawling, with much greater control, using the lower
+level <code>inject</code>, <code>generate</code>, <code>fetch</code>
+and <code>updatedb</code> commands.</li>
+</ol>
+
+</section>
+<section>
+<title>Intranet Crawling</title>
+
+<p>Intranet crawling is more appropriate when you intend to crawl up to
+around one million pages on a handful of web servers.</p>
+
+<section>
+<title>Intranet: Configuration</title>
+
+<p>To configure things for intranet crawling you must:</p>
+
+<ol>
+
+<li>Create a directory with a flat file of root urls. For example, to
+crawl the <code>nutch</code> site you might start with a file named
+<code>urls/nutch</code> containing the url of just the Nutch home
+page. All other Nutch pages should be reachable from this page. The
+<code>urls/nutch</code> file would thus contain:
+<source>
+http://lucene.apache.org/nutch/
+</source>
+</li>
+
+<li>Edit the file <code>conf/crawl-urlfilter.txt</code> and replace
+<code>MY.DOMAIN.NAME</code> with the name of the domain you wish to
+crawl. For example, if you wished to limit the crawl to the
+<code>apache.org</code> domain, the line should read:
+<source>
++^http://([a-z0-9]*\.)*apache.org/
+</source>
+This will include any url in the domain <code>apache.org</code>.
+</li>
+
+</ol>
+
+</section>
+<section>
+<title>Intranet: Running the Crawl</title>
+
+<p>Once things are configured, running the crawl is easy. Just use the
+crawl command. Its options include:</p>
+
+<ul>
+<li><code>-dir</code> <em>dir</em> names the directory to put the crawl
in.</li>
+<li><code>-threads</code> <em>threads</em> determines the number of
+threads that will fetch in parallel.</li>
+<li><code>-depth</code> <em>depth</em> indicates the link depth from the root
+page that should be crawled.</li>
+<li><code>-topN</code> <em>N</em> determines the maximum number of pages that
+will be retrieved at each level up to the depth.</li>
+</ul>
+
+<p>For example, a typical call might be:</p>
+
+<source>
+bin/nutch crawl urls -dir crawl -depth 3 -topN 50
+</source>
+
+<p>Typically one starts testing one's configuration by crawling at
+shallow depths, sharply limiting the number of pages fetched at each
+level (<code>-topN</code>), and watching the output to check that
+desired pages are fetched and undesirable pages are not. Once one is
+confident of the configuration, then an appropriate depth for a full
+crawl is around 10. The number of pages per level
+(<code>-topN</code>) for a full crawl can be from tens of thousands to
+millions, depending on your resources.</p>
+
+<p>Once crawling has completed, one can skip to the Searching section
+below.</p>
+
+</section>
+</section>
+
+<section>
+<title>Whole-web Crawling</title>
+
+<p>Whole-web crawling is designed to handle very large crawls which may
+take weeks to complete, running on multiple machines.</p>
+
+<section>
+<title>Whole-web: Concepts</title>
+
+<p>Nutch data is composed of:</p>
+
+<ol>
+
+ <li>The crawl database, or <em>crawldb</em>. This contains
+information about every url known to Nutch, including whether it was
+fetched, and, if so, when.</li>
+
+ <li>The link database, or <em>linkdb</em>. This contains the list
+of known links to each url, including both the source url and anchor
+text of the link.</li>
+
+ <li>A set of <em>segments</em>. Each segment is a set of urls that are
+fetched as a unit. Segments are directories with the following
+subdirectories:</li>
+
+ <li><ul>
+ <li>a <em>crawl_generate</em> names a set of urls to be fetched</li>
+ <li>a <em>crawl_fetch</em> contains the status of fetching each url</li>
+ <li>a <em>content</em> contains the content of each url</li>
+ <li>a <em>parse_text</em> contains the parsed text of each url</li>
+ <li>a <em>parse_data</em> contains outlinks and metadata parsed
+ from each url</li>
+ <li>a <em>crawl_parse</em> contains the outlink urls, used to
+ update the crawldb</li>
+ </ul></li>
+
+<li>The <em>indexes</em>are Lucene-format indexes.</li>
+
+</ol>
+
+</section>
+<section>
+<title>Whole-web: Boostrapping the Web Database</title>
+
+<p>The <em>injector</em> adds urls to the crawldb. Let's inject URLs
+from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we
+must download and uncompress the file listing all of the DMOZ pages.
+(This is a 200+Mb file, so this will take a few minutes.)</p>
+
+<source>wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
+gunzip content.rdf.u8.gz</source>
+
+<p>Next we select a random subset of these pages.
+ (We use a random subset so that everyone who runs this tutorial
+doesn't hammer the same sites.) DMOZ contains around three million
+URLs. We select one out of every 5000, so that we end up with
+around 1000 URLs:</p>
+
+<source>mkdir dmoz
+bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 >
dmoz/urls</source>
+
+<p>The parser also takes a few minutes, as it must parse the full
+file. Finally, we initialize the crawl db with the selected urls.</p>
+
+<source>bin/nutch inject crawl/crawldb dmoz</source>
+
+<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
+
+</section>
+<section>
+<title>Whole-web: Fetching</title>
+<p>To fetch, we first generate a fetchlist from the database:</p>
+<source>bin/nutch generate crawl/crawldb crawl/segments
+</source>
+<p>This generates a fetchlist for all of the pages due to be fetched.
+ The fetchlist is placed in a newly created segment directory.
+ The segment directory is named by the time it's created. We
+save the name of this segment in the shell variable <code>s1</code>:</p>
+<source>s1=`ls -d crawl/segments/2* | tail -1`
+echo $s1
+</source>
+<p>Now we run the fetcher on this segment with:</p>
+<source>bin/nutch fetch $s1</source>
+<p>When this is complete, we update the database with the results of the
+fetch:</p>
+<source>bin/nutch updatedb crawl/crawldb $s1</source>
+<p>Now the database has entries for all of the pages referenced by the
+initial set.</p>
+
+<p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
+<source>bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s2=`ls -d crawl/segments/2* | tail -1`
+echo $s2
+
+bin/nutch fetch $s2
+bin/nutch updatedb crawl/crawldb $s2
+</source>
+<p>Let's fetch one more round:</p>
+<source>
+bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+s3=`ls -d crawl/segments/2* | tail -1`
+echo $s3
+
+bin/nutch fetch $s3
+bin/nutch updatedb crawl/crawldb $s3
+</source>
+
+<p>By this point we've fetched a few thousand pages. Let's index
+them!</p>
+
+</section>
+<section>
+<title>Whole-web: Indexing</title>
+
+<p>Before indexing we first invert all of the links, so that we may
+index incoming anchor text with the pages.</p>
+
+<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source>
+
+<p>To index the segments we use the <code>index</code> command, as follows:</p>
+
+<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source>
+
+<!-- <p>Then, before we can search a set of segments, we need to delete -->
+<!-- duplicate pages. This is done with:</p> -->
+
+<!-- <source>bin/nutch dedup indexes</source> -->
+
+<p>Now we're ready to search!</p>
+
+</section>
+<section>
+<title>Searching</title>
+
+<p>To search you need to put the nutch war file into your servlet
+container. (If instead of downloading a Nutch release you checked the
+sources out of SVN, then you'll first need to build the war file, with
+the command <code>ant war</code>.)</p>
+
+<p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
+file may be installed with the commands:</p>
+
+<source>rm -rf ~/local/tomcat/webapps/ROOT*
+cp nutch*.war ~/local/tomcat/webapps/ROOT.war
+</source>
+
+<p>The webapp finds its indexes in <code>./crawl</code>, relative
+to where you start Tomcat, so use a command like:</p>
+
+<source>~/local/tomcat/bin/catalina.sh start
+</source>
+
+<p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
+and have fun!</p>
+
+<p>More detailed tutorials are available on the Nutch Wiki.
+</p>
+
+</section>
+</section>
+
+</body>
+</document>