Hi Bodgan,

thanks for your reply. How looks an entry in the regex-urlfilter.txt? 

Looks it likes this: +^http://([a-z0-9]*\.)*domain.com/

And what happens when I do a recrawl? Do I need to have then all domains in the 
regex-urlfilter.txt or just the new ones?

Best regards
RON










----- Original Message ----- 
From: "Bogdan Kecman" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, July 18, 2006 12:32 PM
Subject: RE: Crawl injected Domains only



> Hello List,
> 
> I have a newbie question and I hope that someone can help me. 
> I do a whole web-crawl but I don“t want to leave the injected 
> domains --> nofollow to external domain.
> 
> How can I do that?

Hi,
I havent seen any option to do that in mine experience with
Nutch. The way I do that is at the same time I generate the
list of url's to crawl I also change the regex-urlfilter.txt
Pay a notice that that will slow down the search a bit as
for every URL the nutch will go trough that file 

Hope that helps
Bogdan

Reply via email to