Kragen, that looks really great! I've written something similar for my own use, which is in pseudo-twiki format right now. (see attached)
Shall we join forces in trying to document this beast? :) --matt On Fri, 12 Nov 2004 13:50:50 -0800, Kragen Sitaker <[EMAIL PROTECTED]> wrote: > I just added a couple of new pages to the Wiki, one showing the > dependency structure of the Nutch code (at > <http://www.nutch.org/cgi-bin/twiki/view/Main/NutchLayerDiagram>) and > one describing the inputs and outputs of each of the Nutch tools, plus > the NutchBean (at > <http://www.nutch.org/cgi-bin/twiki/view/Main/InputsAndOutputs>.) I > have found this information very useful in understanding Nutch and > figuring out how to do things with it. I hope others find it useful as > well. > > One current nit is that both pages are up-to-date only as of Nutch 0.5 > and have not been updated to the current CVS code. This may be ideal > for the InputsAndOutputs page, which will be useful even to people just > running a Nutch search engine, but probably not for the layer diagram, > which is mostly useful for people modifying the code. >
A Dissection of the Nutch 0.5 Crawler (10/2004 kangas) Introduction ------------ The open-source Nutch search engine consists, very roughly, of three components: - the crawler, which discovers and retrieves web pages - the WebDB, a custom database that stores known URLs and fetched page contents - the indexer, which dissects pages and builds keyword-based indexes from them This document attempts to describe the operation of the crawler. We begin with theory and drill down to into the details needed to create a customized crawler. Nutch is implemented in Java, so basic knowledge of the language is assumed. The "nutch" shell script ------------------------ http://www.nutch.org/docs/en/tutorial.html The Nutch tutorial describes a number of operations that can be performed using the "bin/nutch" shell script. Looking inside this script, we see that each command corresponds to a specific Java class. For an intranet crawl, you will edit some config files and then call "bin/nutch crawl ...". This corresponds to the class net.nutch.tools.CrawlTool. For a whole-web crawl, you will perform several steps, including: <verbatim> $ bin/nutch admin db -create $ bin/nutch inject db ... $ bin/nutch generate db segments $ bin/nutch fetch ... $ bin/nutch updatedb ... $ bin/nutch analyze ... </verbatim> Each command corresponds to a Java class as follows: - admin: net.nutch.tools.WebDBAdminTool - inject: net.nutch.db.WebDBInjector - generate: net.nutch.tools.FetchListTool - fetch: net.nutch.fetcher.Fetcher - updatedb: net.nutch.tools.UpdateDatabaseTool - analyze: net.nutch.tools.LinkAnalysisTool These commands can be specified using either their nickname, or by their full class name. Thus, the following two commands have the same effect: $ bin/nutch admin db -create $ bin/nutch net.nutch.tools.WebDBAdminTool db -create The ability to invoke arbitrary Java classes will come in handy when we want to customize the behavior of the basic Nutch operations. Let's see how we might do that by examining the one-step intranet crawler. Command "crawl": net.nutch.tools.CrawlTool ------------------------------------------ CrawlTool is a class that does little more than lash together the steps you'd do manually for a whole-web crawl. It consists of two simple static methods, plus a main(). Here is an outline of its operations: <verbatim> - start logger: LogFormatter.getLogger(...) - load "crawl-tool.xml" config file: NutchConf.addConfResource(...) - read arguments from command-line - create a new web db: WebDBAdminTool.main(...) - add root URLs into the db: WebDBInjector.main(...) - for 1 to depth (=5 by default): - generate a new segment: FetchListTool.main(...) - fetch the segment: Fetcher.main(...) - update the db: UpdateDatabaseTool.main(...) - comment: "Re-fetch everything to get complete set of incoming anchor texts" - delete all old segment data: FileUtil.fullyDelete(...) - make a single segment with all pages: FetchListTool.main(...) - re-fetch everything: Fetcher.main(...) - index: IndexSegment.main(...) - dedup: DeleteDuplicates.main(...) - merge: IndexMerger.main(...) </verbatim> Translating this into the equivalent "nutch" script commands, we can see how similar this is to the whole-web crawling process: <verbatim> - (start logger, etc) - bin/nutch admin db -create - bin/nutch inject db ... - (for 1 to depth:) - bin/nutch generate ... - bin/nutch fetch ... - bin/nutch updatedb ... - (call net.nutch.FileUtil.fullyDelete(...)) - bin/nutch generate ... - bin/nutch index ... - bin/nutch dedup ... - bin/nutch merge ... </verbatim> If we wished to customize CrawlTool, we could easily copy its contents to another class, edit, compile, then run it via "bin/nutch" using its full class name. But, as you can see, there isn't much here to customize! The actual work of making HTTP requests is occurs inside Fetcher.main(). Let's examine the steps that occur before Fetcher.main(...), then dive into the crawler itself. Command "admin -create": net.nutch.tools.WebDBAdminTool ------------------------------------------------------- > "admin: database administration, including creation" > > Usage: java net.nutch.tools.WebDBAdminTool db [-create] [-textdump > dumpPrefix] [-scoredump] [-top k] - - - The "-create" options is a wrapper around "WebDBWriter.createWebDB(directory)". This in turn instantiates one WebDBWriter object with the arguments (dir, true) and then immediately calls ".close()" on the object. Using "spam" as a directory name, let's run it and see what it creates: <verbatim> $ bin/nutch admin spam -create $ find spam -type file | xargs ls -l -rw-r--r-- 1 kangas users 0 Oct 25 18:31 spam/dbreadlock -rw-r--r-- 1 kangas users 0 Oct 25 18:31 spam/dbwritelock -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/linksByMD5/data -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/linksByMD5/index -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/linksByURL/data -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/linksByURL/index -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/pagesByMD5/data -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/pagesByMD5/index -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/pagesByURL/data -rw-r--r-- 1 kangas users 16 Oct 25 18:31 spam/webdb/pagesByURL/index </verbatim> Command "inject": net.nutch.db.WebDBInjector -------------------------------------------- > "inject: inject new urls into the database" > > Usage: WebDBInjector <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) > [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] > [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> > [...]]] WebDBInjector.main() accepts two input-type options. "-urlfile" parses a simple list of URLs with one URL per line. "-dmozfile" is for parsing DMOZ RDF files, which is useful for bootstrapping a whole-web database. Let's see how it works. Create a file with one URL, then run "bin/nutch inject": <verbatim> $ vi spam_url.txt $ bin/nutch inject spam -urlfile spam_url.txt $ find spam -type file | xargs ls -l -rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbreadlock -rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbwritelock -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/data -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/index -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/data -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/index -rw-r--r-- 1 kangas users 89 Oct 25 18:57 spam/webdb/pagesByMD5/data -rw-r--r-- 1 kangas users 97 Oct 25 18:57 spam/webdb/pagesByMD5/index -rw-r--r-- 1 kangas users 115 Oct 25 18:57 spam/webdb/pagesByURL/data -rw-r--r-- 1 kangas users 58 Oct 25 18:57 spam/webdb/pagesByURL/index -rw-r--r-- 1 kangas users 17 Oct 25 18:57 spam/webdb/stats </verbatim> We can see that a new "stats" file was created, and the data/index files in the "pagesBy..." directories were modified. Command "generate": net.nutch.tools.FetchListTool ------------------------------------------------ > "generate: generate new segments to fetch" > > Usage: FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize > linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays > numDays] FetchListTool is used to create one or more "segments". From the tutorial: Each segment is a set of pages that are fetched and indexed as a unit. Segment data consists of the following types: - a "fetchlist": file that names the pages to be fetched - the "fetcher output": set of files containing the fetched pages - the "index" is a Lucene-format index of the fetcher output Within CrawlTool.main(), FetchListTool.main() is invoked once per "depth" value with two arguments: (dir + "/db", dir + "/segments"). After processing args, it creates an instance of itself, calls "flt.emitFetchList()", then returns. Let's run FetchListTool to see what it changes on disk. Note that we have to specify the webdb directory, plus another directory where segments are written to. <blockquote> <verbatim> $ bin/nutch generate spam spam_segments $ find spam -type file | xargs ls -l -rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbreadlock -rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbwritelock -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/data -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/index -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/data -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/index -rw-r--r-- 1 kangas users 89 Oct 25 20:18 spam/webdb/pagesByMD5/data -rw-r--r-- 1 kangas users 97 Oct 25 20:18 spam/webdb/pagesByMD5/index -rw-r--r-- 1 kangas users 115 Oct 25 20:18 spam/webdb/pagesByURL/data -rw-r--r-- 1 kangas users 58 Oct 25 20:18 spam/webdb/pagesByURL/index -rw-r--r-- 1 kangas users 17 Oct 25 20:18 spam/webdb/stats $ find spam_segments/ -type file | xargs ls -l -rw-r--r-- 1 kangas users 113 Oct 25 20:18 spam_segments/20041026001828/fetchlist/data -rw-r--r-- 1 kangas users 40 Oct 25 20:18 spam_segments/20041026001828/fetchlist/index </verbatim> </blockquote> Note that no changes occurred under the webdb dir ("spam"), but a new segments directory was created, and data+index files created therein. Command "fetch": net.nutch.fetcher.Fetcher ------------------------------------------ > "fetch: fetch a segment's pages" > > Usage: Fetcher [-logLevel level] [-showThreadID] [-threads n] dir So far we've created a webdb, primed it with URLs, and created a segment that a Fetcher can write to. Now let's look at the Fetcher itself, and try running it to see what comes out. net.nutch.fetcher.Fetcher relies on several other classes: - FetcherThread, an inner class - net.nutch.parse.ParserFactory - net.nutch.plugin.PluginRepository - and, of course, any "plugin" classes loaded by the PluginRepository Fetcher.main() reads arguments, instantiates a new Fetcher object, sets options, then calls run(). The Fetcher constructor is similarly simple; it just instantiates all of the input/output streams: instance variable | class | arguments ---------------------------------------------------------------------- fetchList ArrayFile.Reader (dir, "fetchlist") fetchWriter ArrayFile.Writer (dir, "fetcher", FetcherOutput.class) contentWriter ArrayFile.Writer (dir, "content", Content.class) parseTextWriter ArrayFile.Writer (dir, "parse_text", ParseText.class) parseDataWriter ArrayFile.Writer (dir, "parse_data", ParseData.class) Fetcher.run() instantiates 1..threadCount FetcherThread objects, calls thread.start() on each, sleeps until all threads are gone or a fatal error is logged, then calls close() on the i/o streams. FetcherThread is an inner class of net.nutch.fetcher.Fetcher that extends java.lang.Thread. It has one instance method, run(), and three static methods: handleFetch(), handleNoFetch(), and logError(). FetcherThread.run() instantiates a new FetchListEntry called "fle", then runs the following in an infinite loop: 1. If a fatal error error was logged, break 2. Get the next entry in the FetchList, break if none remain 3. Extract url from FetchListEntry 4. If the FetchListEntry is not tagged "fetch", call this.handleNoFetch() with status=1. This in turn does: a. Get MD5Hash.digest() of url b. Build a FetcherOutput(fle, hash, status) c. Build empty Content, ParseText, and ParseData objects d. Call Fetcher.outputPage() with all of these objects 5. If is tagged "fetch", call ProtocolFactory and get Protocol and Content objects for this url 6. Call this.handleFetch(url, fle, content). This in turn does: a. Call ParserFactory.getParser() for this content type b. Call parser.getParse(content) c. Call Fetcher.outputPage() with a new FetcherOutput, including url MD5, the populated Content object, and a new ParseText 7. On every 100th pass through loop, write a status message to the log 8. Catch any exceptions and log as necessary As we can see here, the fetcher relies on Factory classes to choose the code it uses for different content types: ProtocolFactory() finds a Protocol instance for a given url, and ParserFactory finds a Parser for a given contentType. It should now be apparent that implementing a custom crawler with Nutch will revolve around creating new Protocol/Parser classes, and updating ProtocolFactory/ParserFactory to load them as needed. Let's examine these classes now. Factory classes: Overview ------------------------- > Class net.nutch.parser.ParserFactory > used by: > - net.nutch.db.WebDBInjector > - net.nutch.fetcher.Fetcher > - net.nutch.parser.ParserChecker > > Class net.nutch.protocol.ProtocolFactory > used by: > - net.nutch.fetcher.Fetcher > - net.nutch.parser.ParserChecker > > Class net.nutch.net.URLFilterFactory > used by: > - net.nutch.db.WebDBInjector > - net.nutch.tools.UpdateDatabaseTool > > Class net.nutch.plugin.PluginRepository: used by (Parser/Protocol)Factory Nutch's ParserFactory and ProtocolFactory classes are the key extension points for the crawler. URLFilterFactory additionally provides an extension point for other components, including WebDBInjector and UpdateDatabaseTool. These "Factory" classes can all be reconfigured by editing XML config files. So before we describe the mechanics of any of the Factory classes, we need take a quick look at Nutch's configuration system. Aside: net.nutch.util.NutchConfig --------------------------------- If you have been reading the code along with our discussion, you may have noticed several "private static final" variables at the start of the "command" class definitions. For example, net.nutch.db.WebDBInjector has these definitions for DEFAULT_INTERVAL and NEW_INJECTED_PAGE_NAME: <blockquote> <verbatim> private static final byte DEFAULT_INTERVAL = (byte)NutchConf.getInt("db.default.fetch.interval", 30); private static final float NEW_INJECTED_PAGE_SCORE = NutchConf.getFloat("db.score.injected", 2.0f); </verbatim> <blockquote> The values are loaded by calls to net.nutch.util.NutchConf, which is, intuitively enough, a class that loads configuration files. It has two static variables, "List resourceNames" and "Properties properties".The class has several static methods to manipulate these variables. Here's a summary of its operations: 1. resourceNames is initialized with the strings "nutch-default.xml" and "nutch-site.xml" 2. "properties" is initially null 3. A call to one of the "getXXX" methods results in a call to getProps(). If (properties == null), loadResource() is successively called with the values from "resourceNames". 4. loadResource() loads each file, parses the XML, and sets values in "properties" per the config Factory classes: URLFilterFactory --------------------------------- > Class net.nutch.net.URLFilterFactory > used by: > - net.nutch.db.WebDBInjector > - net.nutch.tools.UpdateDatabaseTool > URLFilterFactory is not strictly part of the crawler, but it is a good extension point within Nutch. Here's how it works: 1. When the class is loaded, URLFILTER_CLASS is set to the value returned by NutchConf for the key "urlfilter.class" 2. When getFilter() is called, it checks to see if the filter class has already been loaded. If not, we load it using Class.forName(URLFILTER_CLASS), and the class is returned. It loads one class, which is configurable via "urlfilter.class". By default, nutch-default.xml specifies this as follows: <blockquote> <verbatim> <!-- urlfilter properties --> <property> <name>urlfilter.class</name> <value>net.nutch.net.RegexURLFilter</value> <description>Name of the class used to filter URLs.</description> </property> <property> <name>urlfilter.regex.file</name> <value>regex-urlfilter.txt</value> <description>Name of file on CLASSPATH containing default regular expressions used by RegexURLFilter.</description> </property> </verbatim> </blockquote> Now let's look at the crawler factories, which are a bit more complex. Factory classes: ParserFactory, ProtocolFactory ----------------------------------------------- > Class net.nutch.parser.ParserFactory > used by: > - net.nutch.db.WebDBInjector > - net.nutch.fetcher.Fetcher > - net.nutch.parser.ParserChecker > > Class net.nutch.protocol.ProtocolFactory > used by: > - net.nutch.fetcher.Fetcher > - net.nutch.parser.ParserChecker > > Class net.nutch.plugin.PluginRepository: used by all of the above ParserFactory and ProtocolFactory are called directly from net.nutch.fetcher.Fetcher, to get the appropriate Parser and Protocol objects for a given content_type and url. They both use an instance of net.nutch.plugin.PluginRepository to find and load Java classes. By default, nutch-default.xml tells PluginRepository to look for classes in a directory called "plugins" somewhere on the Java classpath. Normally you'll just use the one in your Nutch install directory. <blockquote> <verbatim> <!-- plugin properties --> <property> <name>plugin.folders</name> <value>plugins</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> </verbatim> </blockquote> Inside the plugin directory you will find a handful of sub-directories, each containing a file called "plugin.xml" and one or more Java archive (.jar) files. Directories include: - parse-html - parse-text - parse-msword - parse-pdf - protocol-file - protocol-ftp - protocol-http One directory, plus the "plugin.xml" and .jar file contents, constitutes one "plugin". The XML file is a descriptor that is read by PluginRepository to determine two main things: (a) What "extension point" (Java interface) the plugin implements, and (b) how to load its contents. Here is the plugin.xml file for "protocol-file": <blockquote> <verbatim> <?xml version="1.0" encoding="UTF-8"?> <plugin id="protocol-file" name="File Protocol Plug-in" version="1.0.0" provider-name="nutch.org"> <extension-point id="net.nutch.protocol.Protocol" name="Nutch Protocol"/> <runtime> <library name="protocol-file.jar"> <export name="*"/> </library> </runtime> <extension id="net.nutch.protocol.file" name="FileProtocol" point="net.nutch.protocol.Protocol"> <implementation id="net.nutch.protocol.file.File" class="net.nutch.protocol.file.File" protocolName="file"/> </extension> </plugin> </verbatim> </blockquote> Since the plugin is named "protocol-file", you probably guessed already that this is a protocol handler for loading files on disk. But this descriptor tells us -- and PluginRepository -- precisely what it does: - the extension-point (Java interface) name is "net.nutch.protocol.Protocol" - the protocolName is "file" Thus, when Nutch sees a URL that starts with "file://", it will know to call this plugin to fetch that page. Look at the descriptors for "protocol-http" and "protocol-ftp". You should see that the extension-point is exactly the same as for protocol-file, but the protocolName is different: "http" and "ftp", respectively. Now let's examine the descriptor for parse-text: <verbatim> <blockquote> <?xml version="1.0" encoding="UTF-8"?> <plugin id="parse-text" name="Text Parse Plug-in" version="1.0.0" provider-name="nutch.org"> <extension-point id="net.nutch.parse.Parser" name="Nutch Content Parser"/> <runtime> <library name="parse-text.jar"> <export name="*"/> </library> </runtime> <extension id="net.nutch.parse.text" name="TextParse" point="net.nutch.parse.Parser"> <implementation id="net.nutch.parse.text.TextParser" class="net.nutch.parse.text.TextParser" contentType="text/plain" pathSuffix="txt"/> </extension> </plugin> </verbatim> </blockquote> Note that the extension-point is now net.nutch.parse.Parser. And this time, <verbatim><extension><implementation></verbatim> doesn't specify a protocolName. Instead, we see "contentType" and "pathSuffix". So now we see how PluginRepository chooses which plugin to use for a given task: 1. It finds the set of plugins that implement a certain extension-point 2. Then, from that set, it finds one that works for the content at hand (protocolName, contentType, or pathSuffix). Look at the descriptor for parse-html. You'll see that it follows these rules. It implements the same extension-point as parse-text (net.nutch.parse.Parser), but it has different values for contentType and pathSuffix values: contentType="text/html" pathSuffix="" This entry looks a bit strange with the empty pathSuffix value. But that just means that this plugin doesn't match any pathSuffix value. So, parse-html is only used when we fetch remote URLs, not anything residing on the local filesystem. Summary: Nutch crawler extension points --------------------------------------- The main ways to configure the Nutch crawler are as follows: 1. Configuration files. Default values are in nutch-default.xml, and you should override them in nutch-site.xml. 2. URLFilter interface. By default, the class net.nutch.net.RegexURLFilter is used, which reads regular expression patterns from regex-urlfilter.txt. So, you can: - Edit that file to tune its behavior - Or, write a new class that implements net.nutch.net.URLFilter, and change nutch-site.xml to use it. 3. Protocol interface. To add support for a new protocol, write or add a plugin to the "plugins" directory. To change protocol behavior, modify the approprite plugin. 4. Parser interface. As for Protocol, you should add/create a plugin for any new content-types. Otherwise, you will need to replace the appropriate plugin if you want to modify its behavior. 5. If you need to make other changes, refer to our discussion of Fetcher and FetchListTool. Consider subclassing these classes, overriding the appropriate method, then calling your class from the "nutch" script using the full class path.
