I think the +^http://([a-z0-9]*\.).*\.be/ doesn't exclude www.adobe.com/be/at all, the regex doesn't necessarily indicate the url should be ended with be, for doing so, you should add $ at the end, though I'am not quite sure.
feedsky.51vip.biz 2008/11/26 ML mail <[email protected]> > Hello, > > I am using Nutch 0.9 to index all domain names ending with a specific top > level domain (actually .be) and for this purpose I have configured the > following line in crawl-urlfilter.txt: > > +^http://([a-z0-9]*\.).*\.be/ > > all the rest is skipped with the "-." filter at the end of the file. > > Strangely I end up with a small but still increasing amount of websites not > ending with .be in the index... > > I guess this has something to do with redirections, for example I noticed > that www.adobe.be redirects to www.adobe.com/be/ and so www.adobe.com gets > indexed by Nutch. > > Now my question is: how can I prevent indexing all these redirected URLs > which do not redirect to a .be domain ? > > Many thanks in advance > Regards > > > >
