Hi The URLFilter is a part of Nutch's plugin system. It implementations limit the URLs that nutch attempts to fetch. [0] . About nutch plugin system, you can see this [1]. and how to write a plugin, you can see this [2]. you can config the plugin.includes property in nutch-site.xml that include any plugins what you want. the default configuration property is in nutch-default.xml file.
for politeness of nutch, you can see some properties of fetcher in nutch-default.xml. such as fetcher.server.delay, it control the number of seconds the fetcher will delay between successive requests to the same server. fetcher.threads.fetch property, it control the number of FetcherThreads the fetcher should use. etc. [0] http://wiki.apache.org/nutch/AboutPlugins [1] http://wiki.apache.org/nutch/PluginCentral [2] http://wiki.apache.org/nutch/WritingPluginExample On Mon, Apr 22, 2013 at 8:06 PM, naveen shukla < [email protected]> wrote: > Hi All, > > I am a developer want to write a plugin using JSOUP in nutch for parsing > the html file. But to get better feel of it i would need to understand the > whole functionality. > > What i perceived is URLFilter, URLFilterChecker and URLFilters.java but i > get confused when i see the following files RegexURLFilter, PrefixURLFilter. > > Please can anybody tell me exactly which java files are handling the URL > filtering and politeness of the crawler. > > Awaiting for positive reply. > > Thanks in advance. > > From: > > Naveen Shukla > > -- Don't Grow Old, Grow Up... :-)

