Re: Some sites are indexed even if they are not included in crawl-urlfilter.txt

jianguo cai Mon, 22 Dec 2008 23:09:50 -0800

I think the +^http://([a-z0-9]*\.).*\.be/ doesn't exclude
www.adobe.com/be/at all, the regex doesn't necessarily indicate the
url should be ended with
be, for doing so, you should add $ at the end, though I'am not quite sure.


feedsky.51vip.biz


2008/11/26 ML mail <[email protected]>

> Hello,
>
> I am using Nutch 0.9 to index all domain names ending with a specific top
> level domain (actually .be) and for this purpose I have configured the
> following line in crawl-urlfilter.txt:
>
> +^http://([a-z0-9]*\.).*\.be/
>
> all the rest is skipped with the "-." filter at the end of the file.
>
> Strangely I end up with a small but still increasing amount of websites not
> ending with .be in the index...
>
> I guess this has something to do with redirections, for example I noticed
> that www.adobe.be redirects to www.adobe.com/be/ and so www.adobe.com gets
> indexed by Nutch.
>
> Now my question is: how can I prevent indexing all these redirected URLs
> which do not redirect to a .be domain ?
>
> Many thanks in advance
> Regards
>
>
>
>

Re: Some sites are indexed even if they are not included in crawl-urlfilter.txt

Reply via email to