Hi Bodgan, thanks for your reply. How looks an entry in the regex-urlfilter.txt?
Looks it likes this: +^http://([a-z0-9]*\.)*domain.com/ And what happens when I do a recrawl? Do I need to have then all domains in the regex-urlfilter.txt or just the new ones? Best regards RON ----- Original Message ----- From: "Bogdan Kecman" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Tuesday, July 18, 2006 12:32 PM Subject: RE: Crawl injected Domains only > Hello List, > > I have a newbie question and I hope that someone can help me. > I do a whole web-crawl but I don“t want to leave the injected > domains --> nofollow to external domain. > > How can I do that? Hi, I havent seen any option to do that in mine experience with Nutch. The way I do that is at the same time I generate the list of url's to crawl I also change the regex-urlfilter.txt Pay a notice that that will slow down the search a bit as for every URL the nutch will go trough that file Hope that helps Bogdan
