Re: Crawling a ccTLD

Mauro Vignati Mon, 23 Mar 2009 02:24:50 -0700

Hi Julien,
Many thanks for your help ! Your answer will save me a lot of hours in
research and test.I have found the plugin and testing it.
I have just one more question. Because I am really noob, I never installed a
plugin in Nutch. If I have understood correctly,
I need the following 3 files:


in the "conf" I have to add a "domain-urlfilter.txt": this file will declare
which TLD will be crawled/fetched (I mean, URLs that don't match the TLD are
not fetched)

in the "plugins" I have to add a folder called "urlfilter-domain" with 2
files inside: "DomainURLFilter.jar" and  "plugin.xml". For this 2 files I
don't have to change anything in the configuration.

Am I right or should I change something?

Thanks again for your help !
Warm regards
Mauro


Hi Mauro,
>
> Have a look at the domain filter plugin in the SVN version of the code. It
> will allow you to filter based on the TLD.
>
> HTH
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> 2009/3/19 Mauro Vignati <vig...@gmail.com>
>
> > Hi,
> > I'm testing Nutch and until now everything works fine (ok, some hours
> spent
> > in reading, testing, testing and testing but it's normal.
> > I have a noob question: I have to crawl websites only within a ccTLD.
> >
> > In the crawl-urlfilter.txt should I wright so:
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*.ch/
> >
> >
> > or so
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*ch/
> >
> >
> > The difference is the dot before the "ch" ccTLD. I mean, the dot before
> the
> > bracket is already dividing the ccTLD and the name (or the root and a
> > subdomain) or sould I add one like in the first exemple? In the
> > installation
> > guide I can see:
> >
> > +^http://([a-z0-9]*\.)*apache.org/
> >
> > Is crawling every subdomain of apache.org (xxx.apache.org) or is
> > crawling apache.org?
> >
> > Many thanks for any help
> > Mauro
> >
>

Re: Crawling a ccTLD

Reply via email to