Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by MattKangas: http://wiki.apache.org/nutch/DissectingTheNutchCrawler ------------------------------------------------------------------------------ The main ways to configure the Nutch crawler are as follows: - 1. Configuration files. Default values are in nutch-default.xml, and you should override them in nutch-site.xml. [[BR]][[BR]] + 1. Configuration files. Default values are in nutch-default.xml, and you should override them in nutch-site.xml. 1. URLFilter interface. By default, the class {{{net.nutch.net.RegexURLFilter}}} is used, which reads regular expression patterns from regex-urlfilter.txt. So, you can: * Edit that file to tune its behavior - * Or, write a new class that implements {{{net.nutch.net.URLFilter}}}, and change nutch-site.xml to use it. [[BR]][[BR]] + * Or, write a new class that implements {{{net.nutch.net.URLFilter}}}, and change nutch-site.xml to use it. - 1. Protocol interface. To add support for a new protocol, write or add a plugin to the "plugins" directory. To change protocol behavior, modify the approprite plugin. [[BR]][[BR]] + 1. Protocol interface. To add support for a new protocol, write or add a plugin to the "plugins" directory. To change protocol behavior, modify the approprite plugin. - 1. Parser interface. As for Protocol, you should add/create a plugin for any new content-types. Otherwise, you will need to replace the appropriate plugin if you want to modify its behavior. [[BR]][[BR]] + 1. Parser interface. As for Protocol, you should add/create a plugin for any new content-types. Otherwise, you will need to replace the appropriate plugin if you want to modify its behavior. 1. If you need to make other changes, refer to our discussion of '''Fetcher''' and '''FetchListTool'''. Consider subclassing these classes, overriding the appropriate method, then calling your class from the "nutch" script using the full class path. ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-cvs mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-cvs
