On Thu, Jul 30, 2009 at 9:15 PM, <[email protected]> wrote: > I would like to know how can I modify nutch code to exclude external links > with certain extensions. For example, if have in urls mydomain.com and my > domain.com has a lot of links like mydomain.com/mylink.shtml, then I want > nutch not to fetch(crawl) these kind of urls at all.
Can't you do this with the existing RegexURLFilter plugin? Make sure urlfilter-regex is listed in plugin.includes, and that you've got the property urlfilter.regex.file is set to a file (probably regex-urlfilter.txt). Then you can list the extensions you want to skip in that file. -- http://www.linkedin.com/in/paultomblin
