Hi Julien, Many thanks for your help ! Your answer will save me a lot of hours in research and test.I have found the plugin and testing it. I have just one more question. Because I am really noob, I never installed a plugin in Nutch. If I have understood correctly, I need the following 3 files:
in the "conf" I have to add a "domain-urlfilter.txt": this file will declare which TLD will be crawled/fetched (I mean, URLs that don't match the TLD are not fetched) in the "plugins" I have to add a folder called "urlfilter-domain" with 2 files inside: "DomainURLFilter.jar" and "plugin.xml". For this 2 files I don't have to change anything in the configuration. Am I right or should I change something? Thanks again for your help ! Warm regards Mauro Hi Mauro, > > Have a look at the domain filter plugin in the SVN version of the code. It > will allow you to filter based on the TLD. > > HTH > > Julien > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > > 2009/3/19 Mauro Vignati <vig...@gmail.com> > > > Hi, > > I'm testing Nutch and until now everything works fine (ok, some hours > spent > > in reading, testing, testing and testing but it's normal. > > I have a noob question: I have to crawl websites only within a ccTLD. > > > > In the crawl-urlfilter.txt should I wright so: > > > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*.ch/ > > > > > > or so > > > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*ch/ > > > > > > The difference is the dot before the "ch" ccTLD. I mean, the dot before > the > > bracket is already dividing the ccTLD and the name (or the root and a > > subdomain) or sould I add one like in the first exemple? In the > > installation > > guide I can see: > > > > +^http://([a-z0-9]*\.)*apache.org/ > > > > Is crawling every subdomain of apache.org (xxx.apache.org) or is > > crawling apache.org? > > > > Many thanks for any help > > Mauro > > >