Hello,
I am using Nutch 0.9 to index all domain names ending with a specific top level
domain (actually .be) and for this purpose I have configured the following line
in crawl-urlfilter.txt:
+^http://([a-z0-9]*\.).*\.be/
all the rest is skipped with the "-." filter at the end of the file.
Strangely I end up with a small but still increasing amount of websites not
ending with .be in the index...
I guess this has something to do with redirections, for example I noticed that
www.adobe.be redirects to www.adobe.com/be/ and so www.adobe.com gets indexed
by Nutch.
Now my question is: how can I prevent indexing all these redirected URLs which
do not redirect to a .be domain ?
Many thanks in advance
Regards