Some sites are indexed even if they are not included in crawl-urlfilter.txt

ML mail Tue, 25 Nov 2008 14:06:38 -0800

Hello,

I am using Nutch 0.9 to index all domain names ending with a specific top level 
domain (actually .be) and for this purpose I have configured the following line 
in crawl-urlfilter.txt:


+^http://([a-z0-9]*\.).*\.be/

all the rest is skipped with the "-." filter at the end of the file.

Strangely I end up with a small but still increasing amount of websites not 
ending with .be in the index... 

I guess this has something to do with redirections, for example I noticed that 
www.adobe.be redirects to www.adobe.com/be/ and so www.adobe.com gets indexed 
by Nutch. 

Now my question is: how can I prevent indexing all these redirected URLs which 
do not redirect to a .be domain ? 

Many thanks in advance
Regards

Some sites are indexed even if they are not included in crawl-urlfilter.txt

Reply via email to