Hi Mauro,

Have a look at the domain filter plugin in the SVN version of the code. It
will allow you to filter based on the TLD.

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/3/19 Mauro Vignati <vig...@gmail.com>

> Hi,
> I'm testing Nutch and until now everything works fine (ok, some hours spent
> in reading, testing, testing and testing but it's normal.
> I have a noob question: I have to crawl websites only within a ccTLD.
>
> In the crawl-urlfilter.txt should I wright so:
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*.ch/
>
>
> or so
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*ch/
>
>
> The difference is the dot before the "ch" ccTLD. I mean, the dot before the
> bracket is already dividing the ccTLD and the name (or the root and a
> subdomain) or sould I add one like in the first exemple? In the
> installation
> guide I can see:
>
> +^http://([a-z0-9]*\.)*apache.org/
>
> Is crawling every subdomain of apache.org (xxx.apache.org) or is
> crawling apache.org?
>
> Many thanks for any help
> Mauro
>

Reply via email to